Attempt to consolidate reading of XLOG page

Started by Antonin Houskaalmost 7 years ago62 messages

ah@cybertec.at

almost 7 years ago

1 attachment(s)

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

reuse_xlogread.patchtext/x-diffDownload

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f9a4960f8a..444b5bf910 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1369,7 +1369,7 @@ ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
 	bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
 }
 
-
+static	XLogReadPos	*readPos = NULL;
 
 /*
  * Reads 2PC data from xlog. During checkpoint this data will be moved to
@@ -1386,8 +1386,17 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page,
-									NULL);
+
+	/* First time through? */
+	if (readPos == NULL)
+		readPos = XLogReadInitPos();
+
+	/*
+	 * read_local_xlog_page() eventually calls XLogRead(), so pass the initial
+	 * position.
+	 */
+	xlogreader = XLogReaderAllocate(wal_segment_size, read_local_xlog_page,
+									readPos);
 	if (!xlogreader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9196aa3aae..7d0fdfba87 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -26,6 +28,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1005,6 +1008,191 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize XLOG file position for callers of XLogRead().
+ */
+XLogReadPos *
+XLogReadInitPos(void)
+{
+	XLogReadPos *pos = (XLogReadPos *) palloc(sizeof(XLogReadPos));
+
+	pos->segFile = -1;
+	pos->segNo = 0;
+	pos->segOff = 0;
+	pos->tli = 0;
+	pos->dir = NULL;
+
+	return pos;
+}
+
+#ifdef FRONTEND
+/*
+ * Currently only pg_waldump.c is supposed to set these variables.
+ */
+const char *progname;
+int	WalSegSz;
+
+/*
+ * This is a front-end counterpart of XLogFileNameP.
+ */
+static char *
+XLogFileNameFE(TimeLineID tli, XLogSegNo segno)
+{
+	char	   *result = palloc(MAXFNAMELEN);
+
+	XLogFileName(result, tli, segno, WalSegSz);
+	return result;
+}
+
+/*
+ * XXX pg_waldump.c needs this function. Is there a smart way to put it into
+ * src/common?
+ */
+static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2);
+
+static void
+fatal_error(const char *fmt,...)
+{
+	va_list		args;
+
+	fflush(stdout);
+
+	fprintf(stderr, _("%s: FATAL:  "), progname);
+	va_start(args, fmt);
+	vfprintf(stderr, _(fmt), args);
+	va_end(args);
+	fputc('\n', stderr);
+
+	exit(EXIT_FAILURE);
+}
+#endif	/* FRONTEND */
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ *
+ * Will open, and keep open, one WAL segment stored in the global file
+ * descriptor sendFile. This means if XLogRead is used once, there will
+ * always be one descriptor left open until the process ends, but never
+ * more than one.
+ */
+void
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogReadPos *pos, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+		int	segsize;
+
+#ifndef FRONTEND
+		segsize = wal_segment_size;
+#else
+		segsize = WalSegSz;
+#endif
+
+		startoff = XLogSegmentOffset(recptr, segsize);
+
+		if (pos->segFile < 0 ||
+			!XLByteInSeg(recptr, pos->segNo, segsize) ||
+			(tli != NULL && *tli != pos->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (pos->segFile >= 0)
+				close(pos->segFile);
+
+			XLByteToSeg(recptr, nextSegNo, segsize);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, pos);
+		}
+
+		/* Need to seek in the file? */
+		if (pos->segOff != startoff)
+		{
+			if (lseek(pos->segFile, (off_t) startoff, SEEK_SET) < 0)
+#ifndef FRONTEND
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+								XLogFileNameP(pos->tli, pos->segNo),
+								startoff)));
+#else
+			fatal_error("could not seek in log segment %s to offset %u",
+						XLogFileNameFE(pos->tli, pos->segNo), startoff);
+#endif
+			pos->segOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segsize - startoff))
+			segbytes = segsize - startoff;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		readbytes = read(pos->segFile, p, segbytes);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+		if (readbytes < 0)
+		{
+#ifndef FRONTEND
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+							XLogFileNameP(pos->tli, pos->segNo), pos->segOff,
+							(Size) segbytes)));
+#else
+			fatal_error("could not read from log segment %s, offset %u, length %zu",
+						XLogFileNameFE(pos->tli, pos->segNo), pos->segOff,
+						(Size) segbytes);
+#endif
+		}
+		else if (readbytes == 0)
+		{
+#ifndef FRONTEND
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+							XLogFileNameP(pos->tli, pos->segNo), pos->segOff,
+							readbytes, (Size) segbytes)));
+#else
+			fatal_error("could not read from log segment %s, offset %u: read %d of %zu",
+						XLogFileNameFE(pos->tli, pos->segNo), pos->segOff,
+						readbytes, (Size) segbytes);
+#endif
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+
+		pos->segOff += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..4f29c89c06 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
@@ -896,6 +774,37 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_open_segment(XLogSegNo segNo, TimeLineID *tli,
+								  XLogReadPos *pos)
+{
+	char	path[MAXPGPATH];
+
+	XLogFilePath(path, *tli, segNo, wal_segment_size);
+	pos->segFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (pos->segFile < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	pos->segNo = segNo;
+	pos->segOff = 0;
+	pos->tli = *tli;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -1017,14 +926,16 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
+	Assert(state->wal_segment_size == wal_segment_size);
+
 	/*
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
-			 XLOG_BLCKSZ);
-
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, pageTLI,
+			 (XLogReadPos *) state->private_data,
+			 read_local_xlog_page_open_segment);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 424fe86a1b..20c1ad4a35 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -124,6 +124,7 @@ StartupDecodingContext(List *output_plugin_options,
 					   bool need_full_snapshot,
 					   bool fast_forward,
 					   XLogPageReadCB read_page,
+					   void *read_page_arg,
 					   LogicalOutputPluginWriterPrepareWrite prepare_write,
 					   LogicalOutputPluginWriterWrite do_write,
 					   LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -172,14 +173,13 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, read_page,
+									 read_page_arg);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	ctx->reader->private_data = ctx;
-
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
@@ -234,6 +234,7 @@ CreateInitDecodingContext(char *plugin,
 						  bool need_full_snapshot,
 						  XLogRecPtr restart_lsn,
 						  XLogPageReadCB read_page,
+						  void *read_page_arg,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write,
 						  LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -330,8 +331,8 @@ CreateInitDecodingContext(char *plugin,
 
 	ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
 								 need_full_snapshot, false,
-								 read_page, prepare_write, do_write,
-								 update_progress);
+								 read_page, read_page_arg, prepare_write,
+								 do_write, update_progress);
 
 	/* call output plugin initialization callback */
 	old_context = MemoryContextSwitchTo(ctx->context);
@@ -376,6 +377,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 					  List *output_plugin_options,
 					  bool fast_forward,
 					  XLogPageReadCB read_page,
+					  void *read_page_arg,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
 					  LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -428,8 +430,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
-								 fast_forward, read_page, prepare_write,
-								 do_write, update_progress);
+								 fast_forward, read_page, read_page_arg,
+								 prepare_write, do_write, update_progress);
 
 	/* call output plugin initialization callback */
 	old_context = MemoryContextSwitchTo(ctx->context);
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..b2f30d53f5 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -248,13 +248,20 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 	PG_TRY();
 	{
-		/* restart at slot's confirmed_flush */
+		/*
+		 * Restart at slot's confirmed_flush.
+		 *
+		 * logical_read_local_xlog_page() eventually calls XLogRead(), so set
+		 * the initial position.
+		 */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 									options,
 									false,
 									logical_read_local_xlog_page,
+									XLogReadInitPos(),
 									LogicalOutputPrepareWrite,
-									LogicalOutputWrite, NULL);
+									LogicalOutputWrite,
+									NULL);
 
 		MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 182fe5bc82..dbcaa9c1d8 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -14,6 +14,7 @@
 
 #include "access/htup_details.h"
 #include "access/xlog_internal.h"
+#include "access/xlogreader.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "replication/decode.h"
@@ -144,8 +145,9 @@ create_logical_replication_slot(char *name, char *plugin,
 	ctx = CreateInitDecodingContext(plugin, NIL,
 									false,	/* do not build snapshot */
 									restart_lsn,
-									logical_read_local_xlog_page, NULL, NULL,
-									NULL);
+									logical_read_local_xlog_page,
+									XLogReadInitPos(),
+									NULL, NULL, NULL);
 
 	/* build initial snapshot, might take a while */
 	DecodingContextFindStartpoint(ctx);
@@ -401,11 +403,15 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		 * Create our decoding context in fast_forward mode, passing start_lsn
 		 * as InvalidXLogRecPtr, so that we start processing from my slot's
 		 * confirmed_flush.
+		 *
+		 * logical_read_local_xlog_page() eventually calls XLogRead(), so set
+		 * the initial position.
 		 */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 									NIL,
 									true,	/* fast_forward */
 									logical_read_local_xlog_page,
+									XLogReadInitPos(),
 									NULL, NULL, NULL);
 
 		/*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aae6adc15c..56f9ae88b1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static XLogReadPos	*sendPos= NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -256,7 +247,8 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndOpenSegment(XLogSegNo segNo, TimeLineID *tli,
+							  XLogReadPos *pos);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -285,6 +277,9 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendPos = XLogReadInitPos();
 }
 
 /*
@@ -301,10 +296,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendPos && sendPos->segFile >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendPos->segFile);
+		sendPos->segFile = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -787,7 +782,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, NULL, sendPos,
+			 WalSndOpenSegment);
 
 	return count;
 }
@@ -933,9 +929,13 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		/*
+		 * logical_read_xlog_page() eventually calls XLogRead(), so pass the
+		 * initial position.
+		 */
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
-										logical_read_xlog_page,
+										logical_read_xlog_page, sendPos,
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
@@ -1083,10 +1083,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	 * position.
 	 *
 	 * Do this before sending CopyBoth, so that any errors are reported early.
+	 *
+	 * logical_read_xlog_page() eventually calls XLogRead(), so pass the
+	 * initial position.
 	 */
 	logical_decoding_ctx =
 		CreateDecodingContext(cmd->startpoint, cmd->options, false,
-							  logical_read_xlog_page,
+							  logical_read_xlog_page, sendPos,
 							  WalSndPrepareWrite, WalSndWriteData,
 							  WalSndUpdateProgress);
 
@@ -2344,187 +2347,76 @@ WalSndKill(int code, Datum arg)
 }
 
 /*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
+ * Callback for XLogRead() to open the next segment.
  */
-static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+void
+WalSndOpenSegment(XLogSegNo segNo, TimeLineID *tli, XLogReadPos *pos)
 {
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
+	char		path[MAXPGPATH];
 
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
+	/*
+	 * The timeline is determined below, caller should not do anything about
+	 * it.
+	 */
+	Assert(tli == NULL);
 
-	while (nbytes > 0)
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	pos->tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
 	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+		XLogSegNo	endSegNo;
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			curFileTimeLine = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
-								startoff)));
-			sendOff = startoff;
-		}
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
+		if (pos->segNo == endSegNo)
+			pos->tli = sendTimeLineNextTLI;
+	}
 
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
+	XLogFilePath(path, pos->tli, segNo, wal_segment_size);
+	pos->segFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
 
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
+	if (pos->segFile < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(pos->tli, pos->segNo))));
+		else
 			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
 	}
 
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendFile >= 0)
-		{
-			close(sendFile);
-			sendFile = -1;
-
-			goto retry;
-		}
-	}
+	pos->segNo = segNo;
+	pos->segOff = 0;
 }
 
 /*
@@ -2544,6 +2436,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2686,9 +2579,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendPos->segFile >= 0)
+			close(sendPos->segFile);
+		sendPos->segFile = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
@@ -2759,7 +2652,48 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	XLogRead(&output_message.data[output_message.len], startptr, nbytes,
+			 NULL,				/* WalSndOpenSegment will determine TLI */
+			 sendPos,
+			 WalSndOpenSegment);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendPos->segFile >= 0)
+		{
+			close(sendPos->segFile);
+			sendPos->segFile = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e106fb2ed1..7dd63dd735 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -26,9 +26,9 @@
 #include "rmgrdesc.h"
 
 
-static const char *progname;
+const char *progname;
 
-static int	WalSegSz;
+int	WalSegSz;
 
 typedef struct XLogDumpPrivate
 {
@@ -37,6 +37,7 @@ typedef struct XLogDumpPrivate
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	bool		endptr_reached;
+	XLogReadPos	*pos;
 } XLogDumpPrivate;
 
 typedef struct XLogDumpConfig
@@ -296,126 +297,45 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
 static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
+XLogDumpOpenSegment(XLogSegNo segNo, TimeLineID *tli, XLogReadPos *pos)
 {
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
+	char		fname[MAXPGPATH];
+	int	tries;
 
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
+	XLogFileName(fname, *tli, segNo, WalSegSz);
 
-	while (nbytes > 0)
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is
+	 * available. So we loop for 5 seconds looking for the file to appear
+	 * before giving up.
+	 */
+	for (tries = 0; tries < 10; tries++)
 	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
+		pos->segFile = open_file_in_directory(pos->dir, fname);
+		if (pos->segFile >= 0)
+			break;
+		if (errno == ENOENT)
 		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
 			int			save_errno = errno;
 
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
 
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
+			errno = save_errno;
+			continue;
 		}
+		/* Any other error, fall through and fail */
+		break;
+	}
 
-		/* Update state for read */
-		recptr += readbytes;
+	if (pos->segFile < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
 
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
+	pos->segNo = segNo;
+	pos->segOff = 0;
 }
 
 /*
@@ -441,8 +361,8 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+			 private->pos, XLogDumpOpenSegment);
 
 	return count;
 }
@@ -852,6 +772,7 @@ main(int argc, char **argv)
 	private.startptr = InvalidXLogRecPtr;
 	private.endptr = InvalidXLogRecPtr;
 	private.endptr_reached = false;
+	private.pos = XLogReadInitPos();
 
 	config.bkp_details = false;
 	config.stop_after_records = -1;
@@ -1083,6 +1004,9 @@ main(int argc, char **argv)
 	else
 		identify_target_directory(&private, private.inpath, NULL);
 
+	/* The XLOG position can be used separate from "private". */
+	private.pos->dir = private.inpath;
+
 	/* we don't know what to print */
 	if (XLogRecPtrIsInvalid(private.startptr))
 	{
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f3bae0bf49..9bddbd3042 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -219,6 +219,31 @@ extern void XLogReaderInvalReadState(XLogReaderState *state);
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogReadPos
+{
+	int	segFile;				/* segment file descriptor */
+	XLogSegNo	segNo;			/* segment number */
+	uint32 segOff;				/* offset in the segment */
+	TimeLineID tli;		/* timeline ID of the currently open file */
+
+	char	*dir;				/* directory (only needed by frontends) */
+} XLogReadPos;
+
+/*
+ * Callback to open the specified XLOG segment 'segNo' in timeline 'tli' for
+ * reading and update the position accordingly.
+ */
+typedef void (*XLogOpenSegment) (XLogSegNo segNo,  TimeLineID *tli,
+								 XLogReadPos *pos);
+
+extern XLogReadPos *XLogReadInitPos(void);
+extern void XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID *tli, XLogReadPos *pos,
+					 XLogOpenSegment openSegment);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 0a2a63a48c..59b29433eb 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -99,6 +99,7 @@ extern LogicalDecodingContext *CreateInitDecodingContext(char *plugin,
 						  bool need_full_snapshot,
 						  XLogRecPtr restart_lsn,
 						  XLogPageReadCB read_page,
+														 void *read_page_arg,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write,
 						  LogicalOutputPluginWriterUpdateProgress update_progress);
@@ -107,6 +108,7 @@ extern LogicalDecodingContext *CreateDecodingContext(
 					  List *output_plugin_options,
 					  bool fast_forward,
 					  XLogPageReadCB read_page,
+					  void *read_page_arg,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
 					  LogicalOutputPluginWriterUpdateProgress update_progress);

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Antonin Houska (#1)

Re: Attempt to consolidate reading of XLOG page

Hello.

At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net>

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment, and use XLogReaderState.private
to hold a new struct XLogReadPos for the segment reader. The new
struct is heavily duplicated with XLogReaderState and I'm not
sure the rason why the XLogReadPos is needed.

Anyway, in the first place, such two distinct-but-highly-related
callbacks makes things too complex. Heikki said that the log
reader stuff is better not using callbacks and I agree to that. I
did that once for my own but the code is no longer
applicable. But it seems to be the time to do that.

/messages/by-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi

That would seems like follows. That refactoring separates log
reader and page reader.

for(;;)
{
rc = XLogReadRecord(reader, startptr, errormsg);

if (rc == XLREAD_SUCCESS)
{
/* great, got record */
}
if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
{
elog(ERROR, "invalid record");
}
if (rc == XLREAD_NEED_DATA)
{
/*
* Read a page from disk, and place it into reader->readBuf
*/
XLogPageRead(reader->readPagePtr, /* page to read */
reader->reqLen /* # of bytes to read */ );
/*
* Now that we have read the data that XLogReadRecord()
* requested, call it again.
*/
continue;
}
}

DecodingContextFindStartpoint(ctx)
do
{
read_local_xlog_page(....);
rc = XLogReadRecord (reader);
while (rc == XLREAD_NEED_DATA);

I'm going to do that again.

Any other opinions, or thoughts?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Attempt to consolidate reading of XLOG page

On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment, and use XLogReaderState.private
to hold a new struct XLogReadPos for the segment reader. The new
struct is heavily duplicated with XLogReaderState and I'm not
sure the rason why the XLogReadPos is needed.
Any other opinions, or thoughts?

The focus is on the stability of v12 for the next couple of months, so
please make sure to register it to the next CF if you want feedback.

Here are some basic thoughts after a very quick lookup.

+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogReadPos
+{
+   int segFile;                /* segment file descriptor */
+   XLogSegNo   segNo;          /* segment number */
+   uint32 segOff;              /* offset in the segment */
+   TimeLineID tli;     /* timeline ID of the currently open file */
+
+   char    *dir;               /* directory (only needed by
frontends) */
+} XLogReadPos;
Not sure if there is any point to split that with the XLOG reader
status.

+static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2);
+
+static void
+fatal_error(const char *fmt,...)
In this more confusion accumulates with something which has enough
warts on HEAD when it comes to declare locally equivalents to the
elog() for src/common/.

+/*
+ * This is a front-end counterpart of XLogFileNameP.
+ */
+static char *
+XLogFileNameFE(TimeLineID tli, XLogSegNo segno)
+{
+   char       *result = palloc(MAXFNAMELEN);
+
+   XLogFileName(result, tli, segno, WalSegSz);
+   return result;
+}
We could use a pointer to an allocated area.  Or even better, just a
static variable as this just gets used for error messages to store
temporarily the segment name in a routine part of perhaps
xlogreader.c.
--
Michael

Haribabu Kommi

kommi.haribabu@gmail.com

almost 7 years ago

In reply to: Antonin Houska (#1)

Re: Attempt to consolidate reading of XLOG page

On Fri, Apr 12, 2019 at 2:06 AM Antonin Houska <ah@cybertec.at> wrote:

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch
that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a
pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier,
but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

I didn't check the code, but it is good to combine all the 3 page read
functions
into one instead of spreading the logic.

Regards,
Haribabu Kommi
Fujitsu Australia

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Antonin Houska (#1)

Re: Attempt to consolidate reading of XLOG page

On 2019-Apr-11, Antonin Houska wrote:

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

I agree that xlog reading is pretty messy.

I think ifdef'ing the way XLogRead reports errors is not great. Maybe
we can pass a function pointer that is to be called in case of errors?
Not sure about the walsize; maybe it can be a member in XLogReadPos, and
given to XLogReadInitPos()? (Maybe rename XLogReadPos as
XLogReadContext or something like that, indicating it's not just the
read position.)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#2)

Re: Attempt to consolidate reading of XLOG page

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello.

At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net>

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment,

Good point. The acceptable ways to open file on both frontend and backend side
need to be documented.

and use XLogReaderState.private to hold a new struct XLogReadPos for the
segment reader. The new struct is heavily duplicated with XLogReaderState
and I'm not sure the rason why the XLogReadPos is needed.

ok, I missed the fact that XLogReaderState already contains most of the info
that I put into XLogReadPos. So XLogReadPos is not needed.

Anyway, in the first place, such two distinct-but-highly-related
callbacks makes things too complex. Heikki said that the log
reader stuff is better not using callbacks and I agree to that. I
did that once for my own but the code is no longer
applicable. But it seems to be the time to do that.

/messages/by-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi

Thanks for the link. My understanding is that the drawback of the
XLogReaderState.read_page callback is that it cannot easily switch between
XLOG sources in order to handle failure because the caller of XLogReadRecord()
usually controls those sources too.

However the callback I pass to XLogRead() is different: if it fails, it simply
raises ERROR. Since this indicates rather low-level problem, there's no reason
for this callback to try to recover from the failure.

That would seems like follows. That refactoring separates log
reader and page reader.

for(;;)
{
rc = XLogReadRecord(reader, startptr, errormsg);

if (rc == XLREAD_SUCCESS)
{
/* great, got record */
}
if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
{
elog(ERROR, "invalid record");
}
if (rc == XLREAD_NEED_DATA)
{
/*
* Read a page from disk, and place it into reader->readBuf
*/
XLogPageRead(reader->readPagePtr, /* page to read */
reader->reqLen /* # of bytes to read */ );
/*
* Now that we have read the data that XLogReadRecord()
* requested, call it again.
*/
continue;
}
}

DecodingContextFindStartpoint(ctx)
do
{
read_local_xlog_page(....);
rc = XLogReadRecord (reader);
while (rc == XLREAD_NEED_DATA);

I'm going to do that again.

Any other opinions, or thoughts?

I don't see an overlap between what you do and what I do. It seems that even
if you change the XLOG reader API, you don't care what read_local_xlog_page()
does internally. What I try to fix is XLogRead(), and that is actually a
subroutine of read_local_xlog_page().

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Michael Paquier (#3)

Re: Attempt to consolidate reading of XLOG page

Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment, and use XLogReaderState.private
to hold a new struct XLogReadPos for the segment reader. The new
struct is heavily duplicated with XLogReaderState and I'm not
sure the rason why the XLogReadPos is needed.
Any other opinions, or thoughts?

The focus is on the stability of v12 for the next couple of months, so
please make sure to register it to the next CF if you want feedback.

ok, will do. (A link to mailing list is needed for the CF entry, so I had to
post something anyway :-) Since I don't introduce any kind of "cool new
feature" here, I believe it did not disturb much.)

Here are some basic thoughts after a very quick lookup.
...

Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#5)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I agree that xlog reading is pretty messy.

I think ifdef'ing the way XLogRead reports errors is not great. Maybe
we can pass a function pointer that is to be called in case of errors?

I'll try a bit harder to evaluate the existing approaches to report the same
error on both backend and frontend side.

Not sure about the walsize; maybe it can be a member in XLogReadPos, and
given to XLogReadInitPos()? (Maybe rename XLogReadPos as
XLogReadContext or something like that, indicating it's not just the
read position.)

As pointed out by others, XLogReadPos is not necessary. So if XLogRead()
receives XLogReaderState instead, it can get the segment size from there.

Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Antonin Houska (#8)

5 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Antonin Houska <ah@cybertec.at> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Not sure about the walsize; maybe it can be a member in XLogReadPos, and
given to XLogReadInitPos()? (Maybe rename XLogReadPos as
XLogReadContext or something like that, indicating it's not just the
read position.)

As pointed out by others, XLogReadPos is not necessary. So if XLogRead()
receives XLogReaderState instead, it can get the segment size from there.

Eventually I found out that it's good to have a separate structure for the
read position because walsender calls the XLogRead() function directly, not
via the XLOG reader. Currently the structure name is XLogSegment (maybe
someone can propose better name) and it's a member of XLogReaderState. No
field of the new structure is duplicated now.

The next version of the patch is attached.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v02_001_Unexport_XLogReaderInvalReadState.patchtext/x-diffDownload

This change is not necessary for the XLogRead() refactoring itself, but I
noticed the problem while working on it. Not sure it's worth a separate CF
entry.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9196aa3aae..8cb551a837 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -35,6 +35,7 @@ static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 				XLogRecPtr recptr);
+static void XLogReaderInvalReadState(XLogReaderState *state);
 static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 				 int reqLen);
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3);
@@ -620,7 +621,7 @@ err:
 /*
  * Invalidate the xlogreader's read state to force a re-read.
  */
-void
+static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
 	state->readSegNo = 0;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f3bae0bf49..2d3c067135 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,9 +212,6 @@ extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 							 XLogRecPtr recptr, char *phdr);
 
-/* Invalidate read state */
-extern void XLogReaderInvalReadState(XLogReaderState *state);
-
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */

v02_002_Remove_TLI_argument_from_XLR_callback.patchtext/x-diffDownload

The timeline information is available to caller via XLogReaderState. Now that
XLogRead() is gonna be (sometimes) responsible for determining the TLI, it
would have to be added the (TimeLineID *) argument too, just to be consistent
with the current coding style. Since XLogRead() updates also other
position-specific fields of XLogReaderState, it seems simpler if we remove the
output argument from XLogPageReadCB and always report the TLI via
XLogReaderState.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19b32e21df..5723aa54a7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -886,8 +886,7 @@ static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-			 TimeLineID *readTLI);
+			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -11509,7 +11508,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11626,7 +11625,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->readPageTLI = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8cb551a837..244fc7d634 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -558,7 +558,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -576,7 +576,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -595,7 +595,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..cba180912b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,8 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
 			 XLOG_BLCKSZ);
+	state->readPageTLI = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 09c8b5a5b3..4296fe8fee 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -763,7 +763,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 04a3535dfb..ecb0533436 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -50,8 +50,7 @@ typedef struct XLogPageReadPrivate
 
 static int SimpleXLogPageRead(XLogReaderState *xlogreader,
 				   XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI);
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,8 +238,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogreader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -323,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index d37e9f0817..2283c553b5 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -423,7 +423,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2d3c067135..fade6b2e7d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -34,8 +34,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -95,9 +94,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->readPageTLI to the TLI of the file the page
+	 * was read from.
 	 */
 	XLogPageReadCB read_page;
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..d3aaad839c 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,8 +49,7 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetPagePtr, int reqLen,
-					 XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI);
+					 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 						  XLogRecPtr wantPage, uint32 wantLength);
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index 3fb7ad5d67..f6fcbc615e 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int logical_read_local_xlog_page(XLogReaderState *state,
 							 XLogRecPtr targetPagePtr,
 							 int reqLen, XLogRecPtr targetRecPtr,
-							 char *cur_page, TimeLineID *pageTLI);
+							 char *cur_page);
 
 #endif

v02_003_Introduce_XLogSegment.patchtext/x-diffDownload

Introduce XLogSegment structure.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5723aa54a7..59df789674 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4296,7 +4296,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -7340,7 +7340,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11625,7 +11625,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->readPageTLI = curFileTLI;
+	xlogreader->seg.tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 244fc7d634..98cc5d6d9f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -95,7 +95,9 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment pointer. */
+	XLogSegmentInit(&state->seg, wal_segment_size);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -489,8 +491,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->seg.size - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->seg.size);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -532,12 +534,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->seg.size);
+	targetPageOff = XLogSegmentOffset(pageptr, state->seg.size);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.num &&
+		targetPageOff == state->seg.off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -552,7 +554,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.num && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
@@ -607,8 +609,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.num = targetSegNo;
+	state->seg.off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -624,8 +626,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.num = 0;
+	state->seg.off = 0;
 	state->readLen = 0;
 }
 
@@ -744,16 +746,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->seg.size);
+	offset = XLogSegmentOffset(recptr, state->seg.size);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->seg.size, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -767,7 +769,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -800,7 +802,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  fhdrident_str, sysident_str);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->seg.size)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -817,7 +819,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -837,7 +839,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -862,7 +864,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -1006,6 +1008,19 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize the passed segment pointer.
+ */
+void
+XLogSegmentInit(XLogSegment *seg, int size)
+{
+	seg->file = -1;
+	seg->num = 0;
+	seg->off = 0;
+	seg->tli = 0;
+	seg->dir = NULL;
+	seg->size = size;
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index cba180912b..836c2e2927 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.num *
+	state->seg.size + state->seg.off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->seg.size) <
+		(state->currTLIValidUntil / state->seg.size))
 		return;
 
 	/*
@@ -870,11 +870,11 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
+		XLogRecPtr	endOfSegment = (((wantPage / state->seg.size) + 1)
+									* state->seg.size) - 1;
 
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		Assert(wantPage / state->seg.size ==
+			   endOfSegment / state->seg.size);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -1022,9 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
 			 XLOG_BLCKSZ);
-	state->readPageTLI = pageTLI;
+	state->seg.tli = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4296fe8fee..6dfb525e1a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static XLogSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,6 +276,10 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (XLogSegment *) MemoryContextAlloc(TopMemoryContext, sizeof(XLogSegment));
+	XLogSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -301,10 +296,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->file);
+		sendSeg->file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -2378,15 +2373,16 @@ retry:
 
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->file >= 0)
+				close(sendSeg->file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2414,20 +2410,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
 				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				if (sendSeg->num == endSegNo)
+					sendSeg->tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2438,26 +2434,26 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->tli, sendSeg->num))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->tli, sendSeg->num),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
@@ -2467,29 +2463,29 @@ retry:
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2520,10 +2516,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->file);
+			sendSeg->file = -1;
 
 			goto retry;
 		}
@@ -2689,9 +2685,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->file >= 0)
+			close(sendSeg->file);
+		sendSeg->file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index ecb0533436..a8a0cc4da2 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -321,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2283c553b5..b31b6cdcaf 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1098,6 +1098,9 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	/* Finalize the segment pointer. */
+	xlogreader_state->seg.dir = private.inpath;
+
 	/* first find a valid recptr to start from */
 	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index fade6b2e7d..0d801d5903 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -27,6 +27,20 @@
 
 #include "access/xlogrecord.h"
 
+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogSegment
+{
+	int			file;			/* segment file descriptor */
+	XLogSegNo	num;			/* segment number */
+	uint32		off;			/* offset in the segment */
+	TimeLineID	tli;			/* timeline ID of the currently open file */
+
+	char	   *dir;			/* directory (only needed by frontends) */
+	int			size;			/* segment size */
+} XLogSegment;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -73,11 +87,6 @@ struct XLogReaderState
 	 */
 
 	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
-	/*
 	 * Data input callback (mandatory).
 	 *
 	 * This callback shall read at least reqLen valid bytes of the xlog page
@@ -94,8 +103,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set ->readPageTLI to the TLI of the file the page
-	 * was read from.
+	 * The callback shall set ->seg.tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -150,10 +159,8 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	XLogSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -214,6 +221,8 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+extern void XLogSegmentInit(XLogSegment *seg, int size);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,

v02_004_Use_only_one_implementation_of_XLogRead.patchtext/x-diffDownload

Use only xlogreader.c:XLogRead().

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 98cc5d6d9f..1044d4e4dd 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -26,6 +28,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1022,6 +1025,134 @@ XLogSegmentInit(XLogSegment *seg, int size)
 	seg->size = size;
 }
 
+#ifndef FRONTEND
+/*
+ * Backend should have wal_segment_size variable initialized, segsize is not
+ * used.
+ */
+#define XLogFileNameCommon(tli, num, segsize) XLogFileNameP((tli), (num))
+#define xlr_error(...) ereport(ERROR, (errcode_for_file_access(), errmsg(__VA_ARGS__)))
+#else
+static char xlr_error_msg[MAXFNAMELEN];
+#define XLogFileNameCommon(tli, num, segsize) (XLogFileName(xlr_error_msg, (tli), (num), (segsize)),\
+											   xlr_error_msg)
+#include "fe_utils/logging.h"
+/*
+ * Frontend application (currently only pg_waldump.c) cannot catch and further
+ * process errors, so they simply treat them as fatal.
+ */
+#define xlr_error(...) do {pg_log_fatal(__VA_ARGS__); exit(EXIT_FAILURE); } while(0)
+#endif							/* FRONTEND */
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+void
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = XLogSegmentOffset(recptr, seg->size);
+
+		if (seg->file < 0 ||
+			!XLByteInSeg(recptr, seg->num, seg->size) ||
+			(tli != NULL && *tli != seg->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->file >= 0)
+				close(seg->file);
+
+			XLByteToSeg(recptr, nextSegNo, seg->size);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, seg);
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "num" and "off". However we need to care
+			 * about them too because the function can also be used directly,
+			 * see walsender.c.
+			 */
+			seg->num = nextSegNo;
+			seg->off = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (seg->off != startoff)
+		{
+			if (lseek(seg->file, (off_t) startoff, SEEK_SET) < 0)
+				xlr_error("could not seek in log segment %s to offset %u: %m",
+						  XLogFileNameCommon(seg->tli, seg->num, seg->size),
+						  startoff);
+			seg->off = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (seg->size - startoff))
+			segbytes = seg->size - startoff;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		readbytes = read(seg->file, p, segbytes);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+		if (readbytes < 0)
+		{
+			xlr_error("could not read from log segment %s, offset %u, length %zu: %m",
+					  XLogFileNameCommon(seg->tli, seg->num, seg->size),
+					  seg->off,
+					  (Size) segbytes);
+		}
+		else if (readbytes == 0)
+		{
+			xlr_error("could not read from log segment %s, offset %u: read %d of %zu",
+					  XLogFileNameCommon(seg->tli, seg->num, seg->size),
+					  seg->off,
+					  readbytes,
+					  (Size) segbytes);
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->off += readbytes;
+	}
+}
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 836c2e2927..f4a90a602c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -897,6 +897,35 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 }
 
 /*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_open_segment(XLogSegNo nextSegNo, TimeLineID *tli,
+								  XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, *tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	seg->tli = *tli;
+}
+
+/*
  * read_page callback for reading local xlog files
  *
  * Public because it would likely be very helpful for someone writing another
@@ -1022,10 +1051,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-			 XLOG_BLCKSZ);
-	state->seg.tli = pageTLI;
-
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, &pageTLI,
+			 &state->seg, read_local_xlog_page_open_segment);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6dfb525e1a..afafd4082e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -247,7 +247,9 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
+				  XLogSegment *seg);
+static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -782,7 +784,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, NULL, sendSeg,
+			 WalSndOpenSegment);
 
 	return count;
 }
@@ -2353,7 +2356,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2527,6 +2530,76 @@ retry:
 }
 
 /*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	/*
+	 * The timeline is determined below, caller should not do anything about
+	 * it.
+	 */
+	Assert(tli == NULL);
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	seg->tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, seg->size);
+		if (seg->num == endSegNo)
+			seg->tli = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, seg->tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(seg->tli, seg->num))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
+/*
  * Send out the WAL in its normal physical/stored form.
  *
  * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
@@ -2543,6 +2616,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2758,7 +2832,48 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	XLogRead(&output_message.data[output_message.len], startptr, nbytes,
+			 NULL,				/* WalSndOpenSegment will determine TLI */
+			 sendSeg,
+			 WalSndOpenSegment);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->file >= 0)
+		{
+			close(sendSeg->file);
+			sendSeg->file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b31b6cdcaf..caf5533aeb 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,6 +296,45 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
+static void
+XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, *tli, nextSegNo, seg->size);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		seg->file = open_file_in_directory(seg->dir, fname);
+		if (seg->file >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (seg->file < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+	seg->tli = *tli;
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -441,8 +480,8 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+			 &state->seg, XLogDumpOpenSegment);
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0d801d5903..5ba89a7035 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -221,7 +221,22 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Callback to open the specified XLOG segment nextSegNo in timeline *tli for
+ * reading, and assign the descriptor to ->file. BasicOpenFile() is the
+ * preferred way to open the segment file in backend code, whereas open(2)
+ * should be used in frontend.
+ *
+ * If NULL is passed for tli, the callback must determine the timeline
+ * itself. In any case it's supposed to eventually set ->tli.
+ */
+typedef void (*XLogOpenSegment) (XLogSegNo nextSegNo, TimeLineID *tli,
+								 XLogSegment *seg);
+
 extern void XLogSegmentInit(XLogSegment *seg, int size);
+extern void XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg,
+		 XLogOpenSegment openSegment);
 
 /* Functions for decoding an XLogRecord */

v02_005_Cleanup.patchtext/x-diffDownload

Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the XLogOpenSegment callback) is added
nearby at the same time (the addition and removal of code can get mixed in the diff).

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f4a90a602c..3e531be9b0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -17,14 +17,11 @@
  */
 #include "postgres.h"
 
-#include <unistd.h>
-
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -640,128 +637,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 }
 
 /*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
-/*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
  *
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index afafd4082e..c6a1992471 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -249,7 +249,6 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
 				  XLogSegment *seg);
-static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -2345,191 +2344,6 @@ WalSndKill(int code, Datum arg)
 }
 
 /*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendSeg->file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->file >= 0)
-				close(sendSeg->file);
-
-			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSeg->num == endSegNo)
-					sendSeg->tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
-
-			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->tli, sendSeg->num))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->off != startoff)
-		{
-			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->tli, sendSeg->num),
-								startoff)));
-			sendSeg->off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->file >= 0)
-		{
-			close(sendSeg->file);
-			sendSeg->file = -1;
-
-			goto retry;
-		}
-	}
-}
-
-/*
  * Callback for XLogRead() to open the next segment.
  */
 void
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index caf5533aeb..557816aff1 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -14,7 +14,6 @@
 
 #include <dirent.h>
 #include <sys/stat.h>
-#include <unistd.h>
 
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
@@ -336,128 +335,6 @@ XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
 }
 
 /*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
-/*
  * XLogReader read_page callback
  */
 static int

#10

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Antonin Houska (#9)

Re: Attempt to consolidate reading of XLOG page

On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:

The next version of the patch is attached.

I don't think any of this looks acceptable:

+#ifndef FRONTEND
+/*
+ * Backend should have wal_segment_size variable initialized, segsize is not
+ * used.
+ */
+#define XLogFileNameCommon(tli, num, segsize) XLogFileNameP((tli), (num))
+#define xlr_error(...) ereport(ERROR, (errcode_for_file_access(),
errmsg(__VA_ARGS__)))
+#else
+static char xlr_error_msg[MAXFNAMELEN];
+#define XLogFileNameCommon(tli, num, segsize)
(XLogFileName(xlr_error_msg, (tli), (num), (segsize)),\
+    xlr_error_msg)
+#include "fe_utils/logging.h"
+/*
+ * Frontend application (currently only pg_waldump.c) cannot catch and further
+ * process errors, so they simply treat them as fatal.
+ */
+#define xlr_error(...) do {pg_log_fatal(__VA_ARGS__);
exit(EXIT_FAILURE); } while(0)
+#endif

The backend part doesn't look OK because depending on the value of a
global variable instead of getting the information via parameters
seems like a step backward. The frontend part doesn't look OK because
it locks every application that uses the xlogreader stuff into using
pg_log_fatal when an error occurs, which may not be what everybody
wants to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Robert Haas (#10)

Re: Attempt to consolidate reading of XLOG page

On 2019-May-06, Robert Haas wrote:

On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:

The next version of the patch is attached.

I don't think any of this looks acceptable:

I agree. I inteded to suggest upthread to pass an additional argument
to XLogRead, which is a function that takes a message string and
SQLSTATE; in backend, the function does errstart / errstate / errmsg /
errfinish, and in frontend programs it does pg_log_fatal (and ignores
sqlstate). The message must be sprintf'ed and translated by XLogRead.
(xlogreader.c could itself provide a default error reporting callback,
at least for frontend, to avoid repeating the code). That way, if a
different frontend program wants to do something different, it's fairly
easy to pass a different function pointer.

BTW, having frontend's XLogFileNameCommon use a totally unrelated
variable for its printing is naughty.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#11)

Re: Attempt to consolidate reading of XLOG page

On Mon, May 6, 2019 at 2:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-May-06, Robert Haas wrote:

On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:

The next version of the patch is attached.

I don't think any of this looks acceptable:

I agree. I inteded to suggest upthread to pass an additional argument
to XLogRead, which is a function that takes a message string and
SQLSTATE; in backend, the function does errstart / errstate / errmsg /
errfinish, and in frontend programs it does pg_log_fatal (and ignores
sqlstate). The message must be sprintf'ed and translated by XLogRead.
(xlogreader.c could itself provide a default error reporting callback,
at least for frontend, to avoid repeating the code). That way, if a
different frontend program wants to do something different, it's fairly
easy to pass a different function pointer.

It seems to me that it's better to unwind the stack i.e. have the
function return the error information to the caller and let the caller
do as it likes. The other thread to which Horiguchi-san referred
earlier in this thread seems to me to have basically concluded that
the XLogPageReadCB callback to XLogReaderAllocate is a pain to use
because it doesn't unwind the stack, and work is under way over there
to get rid of that callback for just that reason. Adding a new
callback for error-reporting would just be creating a new instance of
the same issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Robert Haas (#12)

5 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that it's better to unwind the stack i.e. have the
function return the error information to the caller and let the caller
do as it likes.

Thanks for a hint. The next version tries to do that.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v03_001_Unexport_XLogReaderInvalReadState.patchtext/x-diffDownload

This change is not necessary for the XLogRead() refactoring itself, but I
noticed the problem while working on it. Not sure it's worth a separate CF
entry.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9196aa3aae..8cb551a837 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -35,6 +35,7 @@ static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 				XLogRecPtr recptr);
+static void XLogReaderInvalReadState(XLogReaderState *state);
 static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 				 int reqLen);
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3);
@@ -620,7 +621,7 @@ err:
 /*
  * Invalidate the xlogreader's read state to force a re-read.
  */
-void
+static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
 	state->readSegNo = 0;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f3bae0bf49..2d3c067135 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,9 +212,6 @@ extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 							 XLogRecPtr recptr, char *phdr);
 
-/* Invalidate read state */
-extern void XLogReaderInvalReadState(XLogReaderState *state);
-
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */

v03_002_Remove_TLI_argument_from_XLR_callback.patchtext/x-diffDownload

The timeline information is available to caller via XLogReaderState. Now that
XLogRead() is gonna be (sometimes) responsible for determining the TLI, it
would have to be added the (TimeLineID *) argument too, just to be consistent
with the current coding style. Since XLogRead() updates also other
position-specific fields of XLogReaderState, it seems simpler if we remove the
output argument from XLogPageReadCB and always report the TLI via
XLogReaderState.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19b32e21df..5723aa54a7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -886,8 +886,7 @@ static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-			 TimeLineID *readTLI);
+			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -11509,7 +11508,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11626,7 +11625,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->readPageTLI = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8cb551a837..244fc7d634 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -558,7 +558,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -576,7 +576,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -595,7 +595,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..cba180912b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,8 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
 			 XLOG_BLCKSZ);
+	state->readPageTLI = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 09c8b5a5b3..4296fe8fee 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -763,7 +763,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 04a3535dfb..ecb0533436 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -50,8 +50,7 @@ typedef struct XLogPageReadPrivate
 
 static int SimpleXLogPageRead(XLogReaderState *xlogreader,
 				   XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI);
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,8 +238,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogreader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -323,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index d37e9f0817..2283c553b5 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -423,7 +423,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2d3c067135..fade6b2e7d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -34,8 +34,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -95,9 +94,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->readPageTLI to the TLI of the file the page
+	 * was read from.
 	 */
 	XLogPageReadCB read_page;
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..d3aaad839c 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,8 +49,7 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int read_local_xlog_page(XLogReaderState *state,
 					 XLogRecPtr targetPagePtr, int reqLen,
-					 XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI);
+					 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 						  XLogRecPtr wantPage, uint32 wantLength);
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index 3fb7ad5d67..f6fcbc615e 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int logical_read_local_xlog_page(XLogReaderState *state,
 							 XLogRecPtr targetPagePtr,
 							 int reqLen, XLogRecPtr targetRecPtr,
-							 char *cur_page, TimeLineID *pageTLI);
+							 char *cur_page);
 
 #endif

v03_003_Introduce_XLogSegment.patchtext/x-diffDownload

Introduce XLogSegment structure.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5723aa54a7..59df789674 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4296,7 +4296,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -7340,7 +7340,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11625,7 +11625,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->readPageTLI = curFileTLI;
+	xlogreader->seg.tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 244fc7d634..98cc5d6d9f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -95,7 +95,9 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment pointer. */
+	XLogSegmentInit(&state->seg, wal_segment_size);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -489,8 +491,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->seg.size - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->seg.size);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -532,12 +534,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->seg.size);
+	targetPageOff = XLogSegmentOffset(pageptr, state->seg.size);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.num &&
+		targetPageOff == state->seg.off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -552,7 +554,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.num && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
@@ -607,8 +609,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.num = targetSegNo;
+	state->seg.off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -624,8 +626,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.num = 0;
+	state->seg.off = 0;
 	state->readLen = 0;
 }
 
@@ -744,16 +746,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->seg.size);
+	offset = XLogSegmentOffset(recptr, state->seg.size);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->seg.size, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -767,7 +769,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -800,7 +802,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  fhdrident_str, sysident_str);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->seg.size)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -817,7 +819,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -837,7 +839,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -862,7 +864,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -1006,6 +1008,19 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize the passed segment pointer.
+ */
+void
+XLogSegmentInit(XLogSegment *seg, int size)
+{
+	seg->file = -1;
+	seg->num = 0;
+	seg->off = 0;
+	seg->tli = 0;
+	seg->dir = NULL;
+	seg->size = size;
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index cba180912b..836c2e2927 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.num *
+	state->seg.size + state->seg.off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->seg.size) <
+		(state->currTLIValidUntil / state->seg.size))
 		return;
 
 	/*
@@ -870,11 +870,11 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
+		XLogRecPtr	endOfSegment = (((wantPage / state->seg.size) + 1)
+									* state->seg.size) - 1;
 
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		Assert(wantPage / state->seg.size ==
+			   endOfSegment / state->seg.size);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -1022,9 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
 			 XLOG_BLCKSZ);
-	state->readPageTLI = pageTLI;
+	state->seg.tli = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4296fe8fee..6dfb525e1a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static XLogSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,6 +276,10 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (XLogSegment *) MemoryContextAlloc(TopMemoryContext, sizeof(XLogSegment));
+	XLogSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -301,10 +296,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->file);
+		sendSeg->file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -2378,15 +2373,16 @@ retry:
 
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->file >= 0)
+				close(sendSeg->file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2414,20 +2410,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
 				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				if (sendSeg->num == endSegNo)
+					sendSeg->tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2438,26 +2434,26 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->tli, sendSeg->num))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->tli, sendSeg->num),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
@@ -2467,29 +2463,29 @@ retry:
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2520,10 +2516,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->file);
+			sendSeg->file = -1;
 
 			goto retry;
 		}
@@ -2689,9 +2685,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->file >= 0)
+			close(sendSeg->file);
+		sendSeg->file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index ecb0533436..a8a0cc4da2 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -321,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2283c553b5..b31b6cdcaf 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1098,6 +1098,9 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	/* Finalize the segment pointer. */
+	xlogreader_state->seg.dir = private.inpath;
+
 	/* first find a valid recptr to start from */
 	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index fade6b2e7d..0d801d5903 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -27,6 +27,20 @@
 
 #include "access/xlogrecord.h"
 
+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogSegment
+{
+	int			file;			/* segment file descriptor */
+	XLogSegNo	num;			/* segment number */
+	uint32		off;			/* offset in the segment */
+	TimeLineID	tli;			/* timeline ID of the currently open file */
+
+	char	   *dir;			/* directory (only needed by frontends) */
+	int			size;			/* segment size */
+} XLogSegment;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -73,11 +87,6 @@ struct XLogReaderState
 	 */
 
 	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
-	/*
 	 * Data input callback (mandatory).
 	 *
 	 * This callback shall read at least reqLen valid bytes of the xlog page
@@ -94,8 +103,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set ->readPageTLI to the TLI of the file the page
-	 * was read from.
+	 * The callback shall set ->seg.tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -150,10 +159,8 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	XLogSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -214,6 +221,8 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+extern void XLogSegmentInit(XLogSegment *seg, int size);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,

v03_004_Use_only_one_implementation_of_XLogRead.patchtext/x-diffDownload

Use only xlogreader.c:XLogRead().

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 98cc5d6d9f..42458424f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -26,6 +28,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1020,7 +1023,132 @@ XLogSegmentInit(XLogSegment *seg, int size)
 	seg->tli = 0;
 	seg->dir = NULL;
 	seg->size = size;
+	seg->last_req = 0;
+}
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * Returns true if the call succeeded, false if it failed. Caller should check
+ * errno in the case of failure. seg->last_req might also be useful for error
+ * messages.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			readbytes;
+
+		seg->off = XLogSegmentOffset(recptr, seg->size);
+
+		if (seg->file < 0 ||
+			!XLByteInSeg(recptr, seg->num, seg->size) ||
+			(tli != NULL && *tli != seg->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->file >= 0)
+				close(seg->file);
+
+			XLByteToSeg(recptr, nextSegNo, seg->size);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, seg);
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "num" and "off". However we need to care
+			 * about them too because the function can also be used directly,
+			 * see walsender.c.
+			 */
+			seg->num = nextSegNo;
+			seg->off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (seg->size - seg->off))
+			seg->last_req = seg->size - seg->off;
+		else
+			seg->last_req = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->file, p, seg->last_req, seg->off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+			return false;
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->off += readbytes;
+	}
+
+	return true;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific code to handle errors encountered by XLogRead().
+ */
+void
+XLogReadProcessError(XLogSegment *seg)
+{
+	if (errno != 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
 }
+#endif
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 836c2e2927..899bf1b551 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,35 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_open_segment(XLogSegNo nextSegNo, TimeLineID *tli,
+								  XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, *tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	seg->tli = *tli;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -1022,10 +1051,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-			 XLOG_BLCKSZ);
-	state->seg.tli = pageTLI;
-
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, &pageTLI,
+				  &state->seg, read_local_xlog_page_open_segment))
+		XLogReadProcessError(&state->seg);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b03a6d7bb8..116b8b5158 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -247,7 +247,9 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
+				  XLogSegment *seg);
+static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -782,7 +784,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, NULL, sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
 
 	return count;
 }
@@ -2358,7 +2362,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2531,6 +2535,76 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	/*
+	 * The timeline is determined below, caller should not do anything about
+	 * it.
+	 */
+	Assert(tli == NULL);
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	seg->tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, seg->size);
+		if (seg->num == endSegNo)
+			seg->tli = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, seg->tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(seg->tli, seg->num))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2548,6 +2622,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2763,7 +2838,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len], startptr, nbytes,
+				  NULL,				/* WalSndOpenSegment will determine TLI */
+				  sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->file >= 0)
+		{
+			close(sendSeg->file);
+			sendSeg->file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f9df8468a1..e3656554ef 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,6 +296,45 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
+static void
+XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, *tli, nextSegNo, seg->size);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		seg->file = open_file_in_directory(seg->dir, fname);
+		if (seg->file >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (seg->file < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+	seg->tli = *tli;
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -441,8 +480,24 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+				  &state->seg, XLogDumpOpenSegment))
+	{
+		XLogSegment	*seg = &state->seg;
+		int			err = errno;
+		char		fname[MAXPGPATH];
+		int			save_errno = errno;
+
+		XLogFileName(fname, seg->tli, seg->num, seg->size);
+		errno = save_errno;
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->off, (Size) seg->last_req, strerror(err));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->off, (Size) seg->last_req);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0d801d5903..7e17909d57 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@ typedef struct XLogSegment
 
 	char	   *dir;			/* directory (only needed by frontends) */
 	int			size;			/* segment size */
+	int			last_req;		/* the amount of data requested last time */
 } XLogSegment;
 
 typedef struct XLogReaderState XLogReaderState;
@@ -221,7 +222,25 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Callback to open the specified XLOG segment nextSegNo in timeline *tli for
+ * reading, and assign the descriptor to ->file. BasicOpenFile() is the
+ * preferred way to open the segment file in backend code, whereas open(2)
+ * should be used in frontend.
+ *
+ * If NULL is passed for tli, the callback must determine the timeline
+ * itself. In any case it's supposed to eventually set ->tli.
+ */
+typedef void (*XLogOpenSegment) (XLogSegNo nextSegNo, TimeLineID *tli,
+								 XLogSegment *seg);
+
 extern void XLogSegmentInit(XLogSegment *seg, int size);
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg,
+		 XLogOpenSegment openSegment);
+#ifndef FRONTEND
+void XLogReadProcessError(XLogSegment *seg);
+#endif
 
 /* Functions for decoding an XLogRecord */

v03_005_Cleanup.patchtext/x-diffDownload

Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the XLogOpenSegment callback) is added
nearby at the same time (the addition and removal of code can get mixed in the diff).

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 899bf1b551..c35852d7a1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -17,14 +17,11 @@
  */
 #include "postgres.h"
 
-#include <unistd.h>
-
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -639,128 +636,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 116b8b5158..bf2e82a179 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -249,7 +249,6 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
 				  XLogSegment *seg);
-static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -2350,191 +2349,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendSeg->file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->file >= 0)
-				close(sendSeg->file);
-
-			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSeg->num == endSegNo)
-					sendSeg->tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
-
-			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->tli, sendSeg->num))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->off != startoff)
-		{
-			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->tli, sendSeg->num),
-								startoff)));
-			sendSeg->off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->file >= 0)
-		{
-			close(sendSeg->file);
-			sendSeg->file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e3656554ef..1a4577ae7d 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -14,7 +14,6 @@
 
 #include <dirent.h>
 #include <sys/stat.h>
-#include <unistd.h>
 
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
@@ -335,128 +334,6 @@ XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
 	seg->tli = *tli;
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */

#14

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Antonin Houska (#13)

Re: Attempt to consolidate reading of XLOG page

On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that it's better to unwind the stack i.e. have the
function return the error information to the caller and let the caller
do as it likes.

Thanks for a hint. The next version tries to do that.

Hi Antonin,

Could you please send a fresh rebase for the new Commitfest?

Thanks,

--
Thomas Munro
https://enterprisedb.com

#15

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Thomas Munro (#14)

5 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that it's better to unwind the stack i.e. have the
function return the error information to the caller and let the caller
do as it likes.

Thanks for a hint. The next version tries to do that.

Hi Antonin,

Could you please send a fresh rebase for the new Commitfest?

Rebased.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v04-0001-Make-XLogReaderInvalReadState-static.patchtext/x-diffDownload

From 3d441d7f584182dfbc76c14591fe0c045679b66a Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2019 11:54:15 +0200
Subject: [PATCH 1/5] Make XLogReaderInvalReadState static.

This change is not necessary for the XLogRead() refactoring itself, but I
noticed the problem while working on it. Not sure it's worth a separate CF
entry.
---
 src/backend/access/transam/xlogreader.c | 4 ++--
 src/include/access/xlogreader.h         | 4 ----
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 41dae916b4..08bc9695c1 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -35,10 +35,10 @@ static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 							XLogRecPtr recptr);
+static void XLogReaderInvalReadState(XLogReaderState *state);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3);
-
 static void ResetDecoder(XLogReaderState *state);
 
 /* size of the buffer allocated for error message. */
@@ -620,7 +620,7 @@ err:
 /*
  * Invalidate the xlogreader's read state to force a re-read.
  */
-void
+static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
 	state->readSegNo = 0;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 04228e2a87..a3d3cc1e7b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,13 +212,9 @@ extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
 
-/* Invalidate read state */
-extern void XLogReaderInvalReadState(XLogReaderState *state);
-
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
-
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.16.4

v04-0002-Remove-TLI-from-some-argument-lists.patchtext/x-diffDownload

From 4e933993cbf5ad6d8646b064907d3c861ea1536f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2019 11:54:15 +0200
Subject: [PATCH 2/5] Remove TLI from some argument lists.

The timeline information is available to caller via XLogReaderState. Now that
XLogRead() is gonna be (sometimes) responsible for determining the TLI, it
would have to be added the (TimeLineID *) argument too, just to be consistent
with the current coding style. Since XLogRead() updates also other
position-specific fields of XLogReaderState, it seems simpler if we remove the
output argument from XLogPageReadCB and always report the TLI via
XLogReaderState.
---
 src/backend/access/transam/xlog.c              |  7 +++----
 src/backend/access/transam/xlogreader.c        |  6 +++---
 src/backend/access/transam/xlogutils.c         | 11 ++++++-----
 src/backend/replication/logical/logicalfuncs.c |  4 ++--
 src/backend/replication/walsender.c            |  2 +-
 src/bin/pg_rewind/parsexlog.c                  |  9 ++++-----
 src/bin/pg_waldump/pg_waldump.c                |  2 +-
 src/include/access/xlogreader.h                |  8 +++-----
 src/include/access/xlogutils.h                 |  5 ++---
 src/include/replication/logicalfuncs.h         |  2 +-
 10 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b6c9353cbd..f30c2ce0ce 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -885,8 +885,7 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-						 TimeLineID *readTLI);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -11520,7 +11519,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11637,7 +11636,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->readPageTLI = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 08bc9695c1..ed7a72bc14 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -557,7 +557,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -575,7 +575,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -594,7 +594,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..cba180912b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,8 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
 			 XLOG_BLCKSZ);
+	state->readPageTLI = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e7a59b0a92..c8802bbc1f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -763,7 +763,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 287af60c4e..00168c27dc 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -47,10 +47,10 @@ typedef struct XLogPageReadPrivate
 	int			tliIndex;
 } XLogPageReadPrivate;
 
+
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-							   TimeLineID *pageTLI);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -238,8 +238,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogreader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -322,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b95d467805..40c64a0bbf 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -423,7 +423,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index a3d3cc1e7b..e23223ba5b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -34,8 +34,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -95,9 +94,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->readPageTLI to the TLI of the file the page
+	 * was read from.
 	 */
 	XLogPageReadCB read_page;
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..4fb305bafd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,11 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page,
-								 TimeLineID *pageTLI);
+								 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
-
 #endif
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index a9c178a9e6..012096f183 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int	logical_read_local_xlog_page(XLogReaderState *state,
 										 XLogRecPtr targetPagePtr,
 										 int reqLen, XLogRecPtr targetRecPtr,
-										 char *cur_page, TimeLineID *pageTLI);
+										 char *cur_page);
 
 #endif
-- 
2.16.4

v04-0003-Introduce-XLogSegment-structure.patchtext/x-diffDownload

From 89c414f99f5548141297c29be244567a607afb87 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2019 11:54:15 +0200
Subject: [PATCH 3/5] Introduce XLogSegment structure.

---
 src/backend/access/transam/xlog.c       |  6 +--
 src/backend/access/transam/xlogreader.c | 57 +++++++++++++++---------
 src/backend/access/transam/xlogutils.c  | 20 ++++-----
 src/backend/replication/walsender.c     | 78 ++++++++++++++++-----------------
 src/bin/pg_rewind/parsexlog.c           |  2 +-
 src/bin/pg_waldump/pg_waldump.c         |  3 ++
 src/include/access/xlogreader.h         | 32 +++++++++-----
 7 files changed, 111 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f30c2ce0ce..d22b6e1938 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4295,7 +4295,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -7351,7 +7351,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11636,7 +11636,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->readPageTLI = curFileTLI;
+	xlogreader->seg.tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ed7a72bc14..bbe383240d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -94,7 +94,9 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment pointer. */
+	XLogSegmentInit(&state->seg, wal_segment_size);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -488,8 +490,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->seg.size - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->seg.size);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -531,12 +533,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->seg.size);
+	targetPageOff = XLogSegmentOffset(pageptr, state->seg.size);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.num &&
+		targetPageOff == state->seg.off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -551,7 +553,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.num && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
@@ -606,8 +608,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.num = targetSegNo;
+	state->seg.off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -623,8 +625,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.num = 0;
+	state->seg.off = 0;
 	state->readLen = 0;
 }
 
@@ -743,16 +745,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->seg.size);
+	offset = XLogSegmentOffset(recptr, state->seg.size);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->seg.size, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -766,7 +768,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -789,7 +791,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  (unsigned long long) state->system_identifier);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->seg.size)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -806,7 +808,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -826,7 +828,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -851,7 +853,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -995,6 +997,19 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize the passed segment pointer.
+ */
+void
+XLogSegmentInit(XLogSegment *seg, int size)
+{
+	seg->file = -1;
+	seg->num = 0;
+	seg->off = 0;
+	seg->tli = 0;
+	seg->dir = NULL;
+	seg->size = size;
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index cba180912b..836c2e2927 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.num *
+	state->seg.size + state->seg.off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->seg.size) <
+		(state->currTLIValidUntil / state->seg.size))
 		return;
 
 	/*
@@ -870,11 +870,11 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
+		XLogRecPtr	endOfSegment = (((wantPage / state->seg.size) + 1)
+									* state->seg.size) - 1;
 
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		Assert(wantPage / state->seg.size ==
+			   endOfSegment / state->seg.size);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -1022,9 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
 			 XLOG_BLCKSZ);
-	state->readPageTLI = pageTLI;
+	state->seg.tli = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c8802bbc1f..3b5fe2cb94 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static XLogSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,6 +276,10 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (XLogSegment *) MemoryContextAlloc(TopMemoryContext, sizeof(XLogSegment));
+	XLogSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -301,10 +296,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->file);
+		sendSeg->file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -2384,15 +2379,16 @@ retry:
 
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->file >= 0)
+				close(sendSeg->file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2420,20 +2416,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
 				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				if (sendSeg->num == endSegNo)
+					sendSeg->tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2444,26 +2440,26 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->tli, sendSeg->num))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->tli, sendSeg->num),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
@@ -2473,29 +2469,29 @@ retry:
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2526,10 +2522,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->file);
+			sendSeg->file = -1;
 
 			goto retry;
 		}
@@ -2695,9 +2691,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->file >= 0)
+			close(sendSeg->file);
+		sendSeg->file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 00168c27dc..a11b46f2b4 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -321,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 40c64a0bbf..a16793bb8b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1105,6 +1105,9 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	/* Finalize the segment pointer. */
+	xlogreader_state->seg.dir = private.inpath;
+
 	/* first find a valid recptr to start from */
 	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e23223ba5b..26da3613f7 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -27,6 +27,20 @@
 
 #include "access/xlogrecord.h"
 
+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogSegment
+{
+	int			file;			/* segment file descriptor */
+	XLogSegNo	num;			/* segment number */
+	uint32		off;			/* offset in the segment */
+	TimeLineID	tli;			/* timeline ID of the currently open file */
+
+	char	   *dir;			/* directory (only needed by frontends) */
+	int			size;			/* segment size */
+} XLogSegment;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -72,11 +86,6 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
 	/*
 	 * Data input callback (mandatory).
 	 *
@@ -94,8 +103,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set ->readPageTLI to the TLI of the file the page
-	 * was read from.
+	 * The callback shall set ->seg.tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -150,10 +159,8 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	XLogSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -213,6 +220,9 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
+extern void XLogSegmentInit(XLogSegment *seg, int size);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.16.4

v04-0004-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 2f805cb04d0baad311a41993a7be6caab72d602f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2019 11:54:16 +0200
Subject: [PATCH 4/5] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 128 ++++++++++++++++++++++++++++++++
 src/backend/access/transam/xlogutils.c  |  40 ++++++++--
 src/backend/replication/walsender.c     | 125 ++++++++++++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  59 ++++++++++++++-
 src/include/access/xlogreader.h         |  19 +++++
 5 files changed, 359 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index bbe383240d..3f8db2f25d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -26,6 +28,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1009,7 +1012,132 @@ XLogSegmentInit(XLogSegment *seg, int size)
 	seg->tli = 0;
 	seg->dir = NULL;
 	seg->size = size;
+	seg->last_req = 0;
+}
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * Returns true if the call succeeded, false if it failed. Caller should check
+ * errno in the case of failure. seg->last_req might also be useful for error
+ * messages.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			readbytes;
+
+		seg->off = XLogSegmentOffset(recptr, seg->size);
+
+		if (seg->file < 0 ||
+			!XLByteInSeg(recptr, seg->num, seg->size) ||
+			(tli != NULL && *tli != seg->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->file >= 0)
+				close(seg->file);
+
+			XLByteToSeg(recptr, nextSegNo, seg->size);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, seg);
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "num" and "off". However we need to care
+			 * about them too because the function can also be used directly,
+			 * see walsender.c.
+			 */
+			seg->num = nextSegNo;
+			seg->off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (seg->size - seg->off))
+			seg->last_req = seg->size - seg->off;
+		else
+			seg->last_req = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->file, p, seg->last_req, seg->off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+			return false;
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->off += readbytes;
+	}
+
+	return true;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific code to handle errors encountered by XLogRead().
+ */
+void
+XLogReadProcessError(XLogSegment *seg)
+{
+	if (errno != 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
 }
+#endif
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 836c2e2927..899bf1b551 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,35 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_open_segment(XLogSegNo nextSegNo, TimeLineID *tli,
+								  XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, *tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	seg->tli = *tli;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -1022,10 +1051,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-			 XLOG_BLCKSZ);
-	state->seg.tli = pageTLI;
-
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, &pageTLI,
+				  &state->seg, read_local_xlog_page_open_segment))
+		XLogReadProcessError(&state->seg);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3b5fe2cb94..c64a4bf9f8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -247,7 +247,9 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
+							  XLogSegment *seg);
+static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -782,7 +784,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, NULL, sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
 
 	return count;
 }
@@ -2359,7 +2363,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2532,6 +2536,76 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	/*
+	 * The timeline is determined below, caller should not do anything about
+	 * it.
+	 */
+	Assert(tli == NULL);
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	seg->tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, seg->size);
+		if (seg->num == endSegNo)
+			seg->tli = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, seg->tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(seg->tli, seg->num))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2549,6 +2623,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2764,7 +2839,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len], startptr, nbytes,
+				  NULL,			/* WalSndOpenSegment will determine TLI */
+				  sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->file >= 0)
+		{
+			close(sendSeg->file);
+			sendSeg->file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a16793bb8b..3e09519f88 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,6 +296,45 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
+static void
+XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, *tli, nextSegNo, seg->size);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		seg->file = open_file_in_directory(seg->dir, fname);
+		if (seg->file >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (seg->file < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+	seg->tli = *tli;
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -441,8 +480,24 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+				  &state->seg, XLogDumpOpenSegment))
+	{
+		XLogSegment *seg = &state->seg;
+		int			err = errno;
+		char		fname[MAXPGPATH];
+		int			save_errno = errno;
+
+		XLogFileName(fname, seg->tli, seg->num, seg->size);
+		errno = save_errno;
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->off, (Size) seg->last_req, strerror(err));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->off, (Size) seg->last_req);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 26da3613f7..856e1b2e00 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@ typedef struct XLogSegment
 
 	char	   *dir;			/* directory (only needed by frontends) */
 	int			size;			/* segment size */
+	int			last_req;		/* the amount of data requested last time */
 } XLogSegment;
 
 typedef struct XLogReaderState XLogReaderState;
@@ -221,7 +222,25 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Callback to open the specified XLOG segment nextSegNo in timeline *tli for
+ * reading, and assign the descriptor to ->file. BasicOpenFile() is the
+ * preferred way to open the segment file in backend code, whereas open(2)
+ * should be used in frontend.
+ *
+ * If NULL is passed for tli, the callback must determine the timeline
+ * itself. In any case it's supposed to eventually set ->tli.
+ */
+typedef void (*XLogOpenSegment) (XLogSegNo nextSegNo, TimeLineID *tli,
+								 XLogSegment *seg);
+
 extern void XLogSegmentInit(XLogSegment *seg, int size);
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID *tli, XLogSegment *seg,
+					 XLogOpenSegment openSegment);
+#ifndef FRONTEND
+void		XLogReadProcessError(XLogSegment *seg);
+#endif
 
 /* Functions for decoding an XLogRecord */
 
-- 
2.16.4

v04-0005-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From dc93e5e49e00738f083ad78f0aa687520fd65dcb Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 9 Jul 2019 11:54:16 +0200
Subject: [PATCH 5/5] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the XLogOpenSegment callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 125 ----------------------
 src/backend/replication/walsender.c    | 186 ---------------------------------
 src/bin/pg_waldump/pg_waldump.c        | 123 ----------------------
 3 files changed, 434 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 899bf1b551..c35852d7a1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -17,14 +17,11 @@
  */
 #include "postgres.h"
 
-#include <unistd.h>
-
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -639,128 +636,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c64a4bf9f8..68715aed9e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -249,7 +249,6 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
 							  XLogSegment *seg);
-static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -2351,191 +2350,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendSeg->file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->file >= 0)
-				close(sendSeg->file);
-
-			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSeg->num == endSegNo)
-					sendSeg->tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
-
-			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->tli, sendSeg->num))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->off != startoff)
-		{
-			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->tli, sendSeg->num),
-								startoff)));
-			sendSeg->off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->file >= 0)
-		{
-			close(sendSeg->file);
-			sendSeg->file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 3e09519f88..cd44090d36 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -14,7 +14,6 @@
 
 #include <dirent.h>
 #include <sys/stat.h>
-#include <unistd.h>
 
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
@@ -335,128 +334,6 @@ XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
 	seg->tli = *tli;
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.16.4

#16

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#15)

Re: Attempt to consolidate reading of XLOG page

Hi Antonin, could you please rebase again?

Thanks

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#15)

Re: Attempt to consolidate reading of XLOG page

Pushed 0001.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#16)

4 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Hi Antonin, could you please rebase again?

Attached.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v05-0002-Remove-TLI-from-some-argument-lists.patchtext/x-diffDownload

From 7b15d7bbdae13c743ddeae10b8ff79e9b02d8243 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Sep 2019 11:53:54 +0200
Subject: [PATCH 1/4] Remove TLI from some argument lists.

The timeline information is available to caller via XLogReaderState. Now that
XLogRead() is gonna be (sometimes) responsible for determining the TLI, it
would have to be added the (TimeLineID *) argument too, just to be consistent
with the current coding style. Since XLogRead() updates also other
position-specific fields of XLogReaderState, it seems simpler if we remove the
output argument from XLogPageReadCB and always report the TLI via
XLogReaderState.
---
 src/backend/access/transam/xlog.c              |  7 +++----
 src/backend/access/transam/xlogreader.c        |  6 +++---
 src/backend/access/transam/xlogutils.c         | 11 ++++++-----
 src/backend/replication/logical/logicalfuncs.c |  4 ++--
 src/backend/replication/walsender.c            |  2 +-
 src/bin/pg_rewind/parsexlog.c                  |  9 ++++-----
 src/bin/pg_waldump/pg_waldump.c                |  2 +-
 src/include/access/xlogreader.h                |  8 +++-----
 src/include/access/xlogutils.h                 |  5 ++---
 src/include/replication/logicalfuncs.h         |  2 +-
 10 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6876537b62..cd948dbefc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -885,8 +885,7 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-						 TimeLineID *readTLI);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -11523,7 +11522,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11640,7 +11639,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->readPageTLI = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a66e3324b1..2184f4291d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -559,7 +559,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -577,7 +577,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -596,7 +596,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..680bed8278 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,8 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
 			 XLOG_BLCKSZ);
+	state->readPageTLI = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23870a25a5..28d8c31af8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -763,7 +763,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 63c3879ead..0a89f9c02a 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -47,10 +47,10 @@ typedef struct XLogPageReadPrivate
 	int			tliIndex;
 } XLogPageReadPrivate;
 
+
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-							   TimeLineID *pageTLI);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -237,8 +237,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -321,7 +320,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b95d467805..40c64a0bbf 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -423,7 +423,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 735b1bd2fd..d64a9ad82f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -38,8 +38,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -99,9 +98,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->readPageTLI to the TLI of the file the page
+	 * was read from.
 	 */
 	XLogPageReadCB read_page;
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..4fb305bafd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,11 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page,
-								 TimeLineID *pageTLI);
+								 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
-
 #endif
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index a9c178a9e6..012096f183 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int	logical_read_local_xlog_page(XLogReaderState *state,
 										 XLogRecPtr targetPagePtr,
 										 int reqLen, XLogRecPtr targetRecPtr,
-										 char *cur_page, TimeLineID *pageTLI);
+										 char *cur_page);
 
 #endif
-- 
2.22.0

v05-0003-Introduce-XLogSegment-structure.patchtext/x-diffDownload

From a9c5a09eda9c63ab39db7336b04863bead352846 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Sep 2019 11:53:54 +0200
Subject: [PATCH 2/4] Introduce XLogSegment structure.

---
 src/backend/access/transam/xlog.c       |  6 +-
 src/backend/access/transam/xlogreader.c | 57 +++++++++++-------
 src/backend/access/transam/xlogutils.c  | 20 +++----
 src/backend/replication/walsender.c     | 78 ++++++++++++-------------
 src/bin/pg_rewind/parsexlog.c           |  2 +-
 src/bin/pg_waldump/pg_waldump.c         |  3 +
 src/include/access/xlogreader.h         | 32 ++++++----
 7 files changed, 111 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cd948dbefc..c5bfda74f0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4295,7 +4295,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -7354,7 +7354,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11639,7 +11639,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->readPageTLI = curFileTLI;
+	xlogreader->seg.tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2184f4291d..7b4ec81493 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -96,7 +96,9 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment pointer. */
+	XLogSegmentInit(&state->seg, wal_segment_size);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -490,8 +492,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->seg.size - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->seg.size);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -533,12 +535,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->seg.size);
+	targetPageOff = XLogSegmentOffset(pageptr, state->seg.size);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.num &&
+		targetPageOff == state->seg.off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -553,7 +555,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.num && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
@@ -608,8 +610,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.num = targetSegNo;
+	state->seg.off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -625,8 +627,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.num = 0;
+	state->seg.off = 0;
 	state->readLen = 0;
 }
 
@@ -745,16 +747,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->seg.size);
+	offset = XLogSegmentOffset(recptr, state->seg.size);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->seg.size, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -768,7 +770,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -791,7 +793,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  (unsigned long long) state->system_identifier);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->seg.size)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -808,7 +810,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -828,7 +830,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -853,7 +855,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -997,6 +999,19 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize the passed segment pointer.
+ */
+void
+XLogSegmentInit(XLogSegment *seg, int size)
+{
+	seg->file = -1;
+	seg->num = 0;
+	seg->off = 0;
+	seg->tli = 0;
+	seg->dir = NULL;
+	seg->size = size;
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 680bed8278..424bb06919 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.num *
+	state->seg.size + state->seg.off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->seg.size) <
+		(state->currTLIValidUntil / state->seg.size))
 		return;
 
 	/*
@@ -870,11 +870,11 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
+		XLogRecPtr	endOfSegment = (((wantPage / state->seg.size) + 1)
+									* state->seg.size) - 1;
 
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		Assert(wantPage / state->seg.size ==
+			   endOfSegment / state->seg.size);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -1022,9 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
 			 XLOG_BLCKSZ);
-	state->readPageTLI = pageTLI;
+	state->seg.tli = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 28d8c31af8..f5630a63cf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static XLogSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,6 +276,10 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (XLogSegment *) MemoryContextAlloc(TopMemoryContext, sizeof(XLogSegment));
+	XLogSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -301,10 +296,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->file);
+		sendSeg->file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -2384,15 +2379,16 @@ retry:
 
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->file >= 0)
+				close(sendSeg->file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2420,20 +2416,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
 				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				if (sendSeg->num == endSegNo)
+					sendSeg->tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2444,26 +2440,26 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->tli, sendSeg->num))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->tli, sendSeg->num),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
@@ -2473,29 +2469,29 @@ retry:
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2526,10 +2522,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->file);
+			sendSeg->file = -1;
 
 			goto retry;
 		}
@@ -2695,9 +2691,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->file >= 0)
+			close(sendSeg->file);
+		sendSeg->file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 0a89f9c02a..e663fec6a7 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -320,7 +320,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 40c64a0bbf..a16793bb8b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1105,6 +1105,9 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	/* Finalize the segment pointer. */
+	xlogreader_state->seg.dir = private.inpath;
+
 	/* first find a valid recptr to start from */
 	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d64a9ad82f..c2724fff74 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -31,6 +31,20 @@
 
 #include "access/xlogrecord.h"
 
+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogSegment
+{
+	int			file;			/* segment file descriptor */
+	XLogSegNo	num;			/* segment number */
+	uint32		off;			/* offset in the segment */
+	TimeLineID	tli;			/* timeline ID of the currently open file */
+
+	char	   *dir;			/* directory (only needed by frontends) */
+	int			size;			/* segment size */
+} XLogSegment;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -76,11 +90,6 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
 	/*
 	 * Data input callback (mandatory).
 	 *
@@ -98,8 +107,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set ->readPageTLI to the TLI of the file the page
-	 * was read from.
+	 * The callback shall set ->seg.tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -154,10 +163,8 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	XLogSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -217,6 +224,9 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
+extern void XLogSegmentInit(XLogSegment *seg, int size);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.22.0

v05-0004-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 01716ffe400dc61cbd37acdca4c03a79b0b4e117 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Sep 2019 11:53:55 +0200
Subject: [PATCH 3/4] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 128 ++++++++++++++++++++++++
 src/backend/access/transam/xlogutils.c  |  40 ++++++--
 src/backend/replication/walsender.c     | 125 ++++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  59 ++++++++++-
 src/include/access/xlogreader.h         |  19 ++++
 5 files changed, 359 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7b4ec81493..2a80bc5823 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1011,7 +1014,132 @@ XLogSegmentInit(XLogSegment *seg, int size)
 	seg->tli = 0;
 	seg->dir = NULL;
 	seg->size = size;
+	seg->last_req = 0;
+}
+
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli is passed, get the data from timeline *tli. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * Returns true if the call succeeded, false if it failed. Caller should check
+ * errno in the case of failure. seg->last_req might also be useful for error
+ * messages.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli, XLogSegment *seg, XLogOpenSegment openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			readbytes;
+
+		seg->off = XLogSegmentOffset(recptr, seg->size);
+
+		if (seg->file < 0 ||
+			!XLByteInSeg(recptr, seg->num, seg->size) ||
+			(tli != NULL && *tli != seg->tli))
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->file >= 0)
+				close(seg->file);
+
+			XLByteToSeg(recptr, nextSegNo, seg->size);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, tli, seg);
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "num" and "off". However we need to care
+			 * about them too because the function can also be used directly,
+			 * see walsender.c.
+			 */
+			seg->num = nextSegNo;
+			seg->off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (seg->size - seg->off))
+			seg->last_req = seg->size - seg->off;
+		else
+			seg->last_req = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->file, p, seg->last_req, seg->off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+			return false;
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->off += readbytes;
+	}
+
+	return true;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific code to handle errors encountered by XLogRead().
+ */
+void
+XLogReadProcessError(XLogSegment *seg)
+{
+	if (errno != 0)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) seg->last_req)));
+	}
 }
+#endif
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 424bb06919..83b014e04c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,35 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_open_segment(XLogSegNo nextSegNo, TimeLineID *tli,
+								  XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, *tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	seg->tli = *tli;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -1022,10 +1051,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-			 XLOG_BLCKSZ);
-	state->seg.tli = pageTLI;
-
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, &pageTLI,
+				  &state->seg, read_local_xlog_page_open_segment))
+		XLogReadProcessError(&state->seg);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f5630a63cf..0685c320b4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -247,7 +247,9 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
+							  XLogSegment *seg);
+static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -782,7 +784,9 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, NULL, sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
 
 	return count;
 }
@@ -2359,7 +2363,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2532,6 +2536,76 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		path[MAXPGPATH];
+
+	/*
+	 * The timeline is determined below, caller should not do anything about
+	 * it.
+	 */
+	Assert(tli == NULL);
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	seg->tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, seg->size);
+		if (seg->num == endSegNo)
+			seg->tli = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, seg->tli, nextSegNo, seg->size);
+	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (seg->file < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(seg->tli, seg->num))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2549,6 +2623,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2764,7 +2839,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len], startptr, nbytes,
+				  NULL,			/* WalSndOpenSegment will determine TLI */
+				  sendSeg,
+				  WalSndOpenSegment))
+		XLogReadProcessError(sendSeg);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->file >= 0)
+		{
+			close(sendSeg->file);
+			sendSeg->file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a16793bb8b..3e09519f88 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,6 +296,45 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
+static void
+XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+{
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, *tli, nextSegNo, seg->size);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		seg->file = open_file_in_directory(seg->dir, fname);
+		if (seg->file >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (seg->file < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+	seg->tli = *tli;
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -441,8 +480,24 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+				  &state->seg, XLogDumpOpenSegment))
+	{
+		XLogSegment *seg = &state->seg;
+		int			err = errno;
+		char		fname[MAXPGPATH];
+		int			save_errno = errno;
+
+		XLogFileName(fname, seg->tli, seg->num, seg->size);
+		errno = save_errno;
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->off, (Size) seg->last_req, strerror(err));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->off, (Size) seg->last_req);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index c2724fff74..4731023ccc 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -43,6 +43,7 @@ typedef struct XLogSegment
 
 	char	   *dir;			/* directory (only needed by frontends) */
 	int			size;			/* segment size */
+	int			last_req;		/* the amount of data requested last time */
 } XLogSegment;
 
 typedef struct XLogReaderState XLogReaderState;
@@ -225,7 +226,25 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Callback to open the specified XLOG segment nextSegNo in timeline *tli for
+ * reading, and assign the descriptor to ->file. BasicOpenFile() is the
+ * preferred way to open the segment file in backend code, whereas open(2)
+ * should be used in frontend.
+ *
+ * If NULL is passed for tli, the callback must determine the timeline
+ * itself. In any case it's supposed to eventually set ->tli.
+ */
+typedef void (*XLogOpenSegment) (XLogSegNo nextSegNo, TimeLineID *tli,
+								 XLogSegment *seg);
+
 extern void XLogSegmentInit(XLogSegment *seg, int size);
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID *tli, XLogSegment *seg,
+					 XLogOpenSegment openSegment);
+#ifndef FRONTEND
+void		XLogReadProcessError(XLogSegment *seg);
+#endif
 
 /* Functions for decoding an XLogRecord */
 
-- 
2.22.0

v05-0005-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From e926fb9d4b28c630693eb80123a82d5cefb06920 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 9 Sep 2019 11:53:55 +0200
Subject: [PATCH 4/4] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the XLogOpenSegment callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 125 -----------------
 src/backend/replication/walsender.c    | 186 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 123 ----------------
 3 files changed, 434 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 83b014e04c..5b0ba39f17 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -17,14 +17,11 @@
  */
 #include "postgres.h"
 
-#include <unistd.h>
-
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -639,128 +636,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0685c320b4..072d46d853 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -249,7 +249,6 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
 							  XLogSegment *seg);
-static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -2351,191 +2350,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendSeg->file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->file >= 0)
-				close(sendSeg->file);
-
-			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSeg->num == endSegNo)
-					sendSeg->tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
-
-			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->tli, sendSeg->num))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->off != startoff)
-		{
-			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->tli, sendSeg->num),
-								startoff)));
-			sendSeg->off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->file >= 0)
-		{
-			close(sendSeg->file);
-			sendSeg->file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 3e09519f88..cd44090d36 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -14,7 +14,6 @@
 
 #include <dirent.h>
 #include <sys/stat.h>
-#include <unistd.h>
 
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
@@ -335,128 +334,6 @@ XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
 	seg->tli = *tli;
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.22.0

#19

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#18)

1 attachment(s)

Re: Attempt to consolidate reading of XLOG page

I was confused by the struct name XLogSegment -- the struct is used to
represent a WAL segment while it's kept open, rather than just a WAL
segment in abstract. Also, now that we've renamed everything to use the
term WAL, it seems wrong to use the name XLog for new structs. I
propose the name WALOpenSegment for the struct, which solves both
problems. (Its initializer function would get the name
WALOpenSegmentInit.)

Now, the patch introduces a callback for XLogRead, the type of which is
called XLogOpenSegment. If we rename it from XLog to WAL, both names
end up the same. I propose to rename the function type to
WALSegmentOpen, which in a "noun-verb" view of the world, represents the
action of opening a WAL segment.

I attach a patch for all this renaming, on top of your series.

I wonder if each of those WALSegmentOpen callbacks should reset [at
least some members of] the struct; they're already in charge of setting
->file, and apparently we're leaving the responsibility of setting the
rest of the members to XLogRead. That seems weird. Maybe we should say
that the CB should only open the segment and not touch the struct at all
and XLogRead is in charge of everything. Perhaps the other way around
-- the CB should set everything correctly ... I'm not sure which is
best. But having half here and half there seems a recipe for confusion
and bugs.

Another thing I didn't like much is that everything seems to assume that
the only error possible from XLogRead is a read error. Maybe that's
okay, because it seems to be the current reality, but it seemed odd.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xlogopen-rename.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2a80bc5823..3e03d72a18 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -100,7 +100,7 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 	}
 
 	/* Initialize segment pointer. */
-	XLogSegmentInit(&state->seg, wal_segment_size);
+	WALOpenSegmentInit(&state->seg, wal_segment_size);
 
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
@@ -1006,7 +1006,7 @@ out:
  * Initialize the passed segment pointer.
  */
 void
-XLogSegmentInit(XLogSegment *seg, int size)
+WALOpenSegmentInit(WALOpenSegment *seg, int size)
 {
 	seg->file = -1;
 	seg->num = 0;
@@ -1032,7 +1032,7 @@ XLogSegmentInit(XLogSegment *seg, int size)
  */
 bool
 XLogRead(char *buf, XLogRecPtr startptr, Size count,
-		 TimeLineID *tli, XLogSegment *seg, XLogOpenSegment openSegment)
+		 TimeLineID *tli, WALOpenSegment *seg, WALSegmentOpen openSegment)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -1120,7 +1120,7 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count,
  * Backend-specific code to handle errors encountered by XLogRead().
  */
 void
-XLogReadProcessError(XLogSegment *seg)
+XLogReadProcessError(WALOpenSegment *seg)
 {
 	if (errno != 0)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5b0ba39f17..2b9758c3ce 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -776,13 +776,15 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 static void
 read_local_xlog_page_open_segment(XLogSegNo nextSegNo, TimeLineID *tli,
-								  XLogSegment *seg)
+								  WALOpenSegment *seg)
 {
 	char		path[MAXPGPATH];
 
 	XLogFilePath(path, *tli, nextSegNo, seg->size);
 	seg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
 
+	/* XXX should we clear the other fields of WALOpenSegment?? */
+
 	if (seg->file < 0)
 	{
 		if (errno == ENOENT)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 072d46d853..a5ca834020 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,7 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-static XLogSegment *sendSeg = NULL;
+static WALOpenSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -248,7 +248,7 @@ static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli,
-							  XLogSegment *seg);
+							  WALOpenSegment *seg);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -279,8 +279,9 @@ InitWalSender(void)
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 
 	/* Make sure we can remember the current read position in XLOG. */
-	sendSeg = (XLogSegment *) MemoryContextAlloc(TopMemoryContext, sizeof(XLogSegment));
-	XLogSegmentInit(sendSeg, wal_segment_size);
+	sendSeg = (WALOpenSegment *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(WALOpenSegment));
+	WALOpenSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -2354,7 +2355,7 @@ WalSndKill(int code, Datum arg)
  * Callback for XLogRead() to open the next segment.
  */
 void
-WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+WalSndOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, WALOpenSegment *seg)
 {
 	char		path[MAXPGPATH];
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index cd44090d36..8466890748 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,7 +296,7 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 }
 
 static void
-XLogDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, XLogSegment *seg)
+WALDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli, WALOpenSegment *seg)
 {
 	char		fname[MAXPGPATH];
 	int			tries;
@@ -358,9 +358,9 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	}
 
 	if (!XLogRead(readBuff, targetPagePtr, count, &private->timeline,
-				  &state->seg, XLogDumpOpenSegment))
+				  &state->seg, WALDumpOpenSegment))
 	{
-		XLogSegment *seg = &state->seg;
+		WALOpenSegment *seg = &state->seg;
 		int			err = errno;
 		char		fname[MAXPGPATH];
 		int			save_errno = errno;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4731023ccc..c93b255d84 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -32,9 +32,9 @@
 #include "access/xlogrecord.h"
 
 /*
- * Position in XLOG file while reading it.
+ * WALOpenSegment represents a WAL segment being read.
  */
-typedef struct XLogSegment
+typedef struct WALOpenSegment
 {
 	int			file;			/* segment file descriptor */
 	XLogSegNo	num;			/* segment number */
@@ -44,7 +44,7 @@ typedef struct XLogSegment
 	char	   *dir;			/* directory (only needed by frontends) */
 	int			size;			/* segment size */
 	int			last_req;		/* the amount of data requested last time */
-} XLogSegment;
+} WALOpenSegment;
 
 typedef struct XLogReaderState XLogReaderState;
 
@@ -165,7 +165,7 @@ struct XLogReaderState
 	uint32		readLen;
 
 	/* last read XLOG position for data currently in readBuf */
-	XLogSegment seg;
+	WALOpenSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -227,7 +227,7 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
 /*
- * Callback to open the specified XLOG segment nextSegNo in timeline *tli for
+ * Callback to open the specified WAL segment nextSegNo in timeline *tli for
  * reading, and assign the descriptor to ->file. BasicOpenFile() is the
  * preferred way to open the segment file in backend code, whereas open(2)
  * should be used in frontend.
@@ -235,15 +235,15 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
  * If NULL is passed for tli, the callback must determine the timeline
  * itself. In any case it's supposed to eventually set ->tli.
  */
-typedef void (*XLogOpenSegment) (XLogSegNo nextSegNo, TimeLineID *tli,
-								 XLogSegment *seg);
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, TimeLineID *tli,
+								WALOpenSegment *seg);
 
-extern void XLogSegmentInit(XLogSegment *seg, int size);
+extern void WALOpenSegmentInit(WALOpenSegment *seg, int size);
 extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
-					 TimeLineID *tli, XLogSegment *seg,
-					 XLogOpenSegment openSegment);
+					 TimeLineID *tli, WALOpenSegment *seg,
+					 WALSegmentOpen openSegment);
 #ifndef FRONTEND
-void		XLogReadProcessError(XLogSegment *seg);
+void		XLogReadProcessError(WALOpenSegment *seg);
 #endif
 
 /* Functions for decoding an XLogRecord */

#20

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#19)

5 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I was confused by the struct name XLogSegment -- the struct is used to
represent a WAL segment while it's kept open, rather than just a WAL
segment in abstract. Also, now that we've renamed everything to use the
term WAL, it seems wrong to use the name XLog for new structs. I
propose the name WALOpenSegment for the struct, which solves both
problems. (Its initializer function would get the name
WALOpenSegmentInit.)

Now, the patch introduces a callback for XLogRead, the type of which is
called XLogOpenSegment. If we rename it from XLog to WAL, both names
end up the same. I propose to rename the function type to
WALSegmentOpen, which in a "noun-verb" view of the world, represents the
action of opening a WAL segment.

I attach a patch for all this renaming, on top of your series.

ok, thanks.

In addition I renamed WalSndOpenSegment() to WalSndSegmentOpen() and
read_local_xlog_page_open_segment() to read_local_xlog_page_segment_open()

I wonder if each of those WALSegmentOpen callbacks should reset [at
least some members of] the struct; they're already in charge of setting
->file, and apparently we're leaving the responsibility of setting the
rest of the members to XLogRead. That seems weird. Maybe we should say
that the CB should only open the segment and not touch the struct at all
and XLogRead is in charge of everything. Perhaps the other way around
-- the CB should set everything correctly ... I'm not sure which is
best. But having half here and half there seems a recipe for confusion
and bugs.

ok, I've changed the CB signature. Now it receives poiners to the two
variables that it can change while the "seg" argument is documented as
read-only. To indicate that the CB should determine timeline itself, I
introduced a new constant InvalidTimeLineID, see the 0004 part.

Another thing I didn't like much is that everything seems to assume that
the only error possible from XLogRead is a read error. Maybe that's
okay, because it seems to be the current reality, but it seemed odd.

In this case I only moved the ereport() code from XLogRead() away (so that
both backend and frontend can call the function). Given that the code to open
WAL segment is now in the callbacks, the only thing that XLogRead() can
ereport is that read() failed. BTW, I introduced one more structure,
XLogReadError, in this patch version. I think it's better than adding
error-specific fields to the WALOpenSegment structure.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v06-0002-Remove-TLI-from-some-argument-lists.patchtext/x-diffDownload

From 674fa97ef9df8b1fe875139aa7b43f605255d8cd Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 23 Sep 2019 07:40:49 +0200
Subject: [PATCH 2/6] Remove TLI from some argument lists.

The timeline information is available to caller via XLogReaderState. Now that
XLogRead() is gonna be (sometimes) responsible for determining the TLI, it
would have to be added the (TimeLineID *) argument too, just to be consistent
with the current coding style. Since XLogRead() updates also other
position-specific fields of XLogReaderState, it seems simpler if we remove the
output argument from XLogPageReadCB and always report the TLI via
XLogReaderState.
---
 src/backend/access/transam/xlog.c              |  7 +++----
 src/backend/access/transam/xlogreader.c        |  6 +++---
 src/backend/access/transam/xlogutils.c         | 11 ++++++-----
 src/backend/replication/logical/logicalfuncs.c |  4 ++--
 src/backend/replication/walsender.c            |  2 +-
 src/bin/pg_rewind/parsexlog.c                  |  8 +++-----
 src/bin/pg_waldump/pg_waldump.c                |  2 +-
 src/include/access/xlogreader.h                |  8 +++-----
 src/include/access/xlogutils.h                 |  5 ++---
 src/include/replication/logicalfuncs.h         |  2 +-
 10 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7ff004234..7a89dfed7f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -885,8 +885,7 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-						 TimeLineID *readTLI);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -11523,7 +11522,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11640,7 +11639,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->readPageTLI = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a66e3324b1..2184f4291d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -559,7 +559,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -577,7 +577,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -596,7 +596,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..680bed8278 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,8 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
 			 XLOG_BLCKSZ);
+	state->readPageTLI = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23870a25a5..28d8c31af8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -763,7 +763,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 63c3879ead..33e2ba2a03 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,8 +49,7 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-							   TimeLineID *pageTLI);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -237,8 +236,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -321,7 +319,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b95d467805..40c64a0bbf 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -423,7 +423,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 735b1bd2fd..d64a9ad82f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -38,8 +38,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -99,9 +98,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->readPageTLI to the TLI of the file the page
+	 * was read from.
 	 */
 	XLogPageReadCB read_page;
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..4fb305bafd 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,11 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page,
-								 TimeLineID *pageTLI);
+								 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
-
 #endif
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index a9c178a9e6..012096f183 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int	logical_read_local_xlog_page(XLogReaderState *state,
 										 XLogRecPtr targetPagePtr,
 										 int reqLen, XLogRecPtr targetRecPtr,
-										 char *cur_page, TimeLineID *pageTLI);
+										 char *cur_page);
 
 #endif
-- 
2.20.1

v06-0003-Introduce-WALOpenSegment-structure.patchtext/x-diffDownload

From 0022a47048ee65b04b2a3f5f53715e04459baa83 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 23 Sep 2019 07:40:49 +0200
Subject: [PATCH 3/6] Introduce WALOpenSegment structure.

---
 src/backend/access/transam/xlog.c       |  6 +-
 src/backend/access/transam/xlogreader.c | 59 +++++++++++-------
 src/backend/access/transam/xlogutils.c  | 20 +++----
 src/backend/replication/walsender.c     | 79 ++++++++++++-------------
 src/bin/pg_rewind/parsexlog.c           |  2 +-
 src/bin/pg_waldump/pg_waldump.c         |  3 +
 src/include/access/xlogreader.h         | 34 +++++++----
 7 files changed, 116 insertions(+), 87 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7a89dfed7f..031733aba8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4295,7 +4295,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -7354,7 +7354,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11639,7 +11639,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->readPageTLI = curFileTLI;
+	xlogreader->seg.tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2184f4291d..4de5530b3e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -96,7 +96,9 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment pointer. */
+	WALOpenSegmentInit(&state->seg, wal_segment_size);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -490,8 +492,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->seg.size - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->seg.size);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -533,12 +535,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->seg.size);
+	targetPageOff = XLogSegmentOffset(pageptr, state->seg.size);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.num &&
+		targetPageOff == state->seg.off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -553,7 +555,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.num && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
@@ -608,8 +610,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.num = targetSegNo;
+	state->seg.off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -625,8 +627,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.num = 0;
+	state->seg.off = 0;
 	state->readLen = 0;
 }
 
@@ -745,16 +747,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->seg.size);
+	offset = XLogSegmentOffset(recptr, state->seg.size);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->seg.size, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -768,7 +770,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -791,7 +793,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  (unsigned long long) state->system_identifier);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->seg.size)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -808,7 +810,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -828,7 +830,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -853,7 +855,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.tli, segno, state->seg.size);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -997,6 +999,21 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Initialize the passed segment pointer.
+ */
+void
+WALOpenSegmentInit(WALOpenSegment *seg, int size)
+{
+	seg->file = -1;
+	seg->num = 0;
+	seg->off = 0;
+	seg->tli = 0;
+	seg->size = size;
+#ifdef FRONTEND
+	seg->dir = NULL;
+#endif
+}
 
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 680bed8278..424bb06919 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.num *
+	state->seg.size + state->seg.off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->seg.size) <
+		(state->currTLIValidUntil / state->seg.size))
 		return;
 
 	/*
@@ -870,11 +870,11 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
+		XLogRecPtr	endOfSegment = (((wantPage / state->seg.size) + 1)
+									* state->seg.size) - 1;
 
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		Assert(wantPage / state->seg.size ==
+			   endOfSegment / state->seg.size);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -1022,9 +1022,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, pageTLI, targetPagePtr,
+	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
 			 XLOG_BLCKSZ);
-	state->readPageTLI = pageTLI;
+	state->seg.tli = pageTLI;
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 28d8c31af8..a617e20ab6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,7 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static WALOpenSegment *sendSeg = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -285,6 +276,11 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (WALOpenSegment *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(WALOpenSegment));
+	WALOpenSegmentInit(sendSeg, wal_segment_size);
 }
 
 /*
@@ -301,10 +297,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->file);
+		sendSeg->file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -2384,15 +2380,16 @@ retry:
 
 		startoff = XLogSegmentOffset(recptr, wal_segment_size);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->file >= 0)
+				close(sendSeg->file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2420,20 +2417,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
 				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				if (sendSeg->num == endSegNo)
+					sendSeg->tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2444,26 +2441,26 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->tli, sendSeg->num))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->tli, sendSeg->num),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
@@ -2473,29 +2470,29 @@ retry:
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->tli, sendSeg->num),
+							sendSeg->off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2526,10 +2523,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->file);
+			sendSeg->file = -1;
 
 			goto retry;
 		}
@@ -2695,9 +2692,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->file >= 0)
+			close(sendSeg->file);
+		sendSeg->file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 33e2ba2a03..6cea9342d6 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -319,7 +319,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->readPageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 40c64a0bbf..a16793bb8b 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1105,6 +1105,9 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	/* Finalize the segment pointer. */
+	xlogreader_state->seg.dir = private.inpath;
+
 	/* first find a valid recptr to start from */
 	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d64a9ad82f..b9d99d524e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -31,6 +31,22 @@
 
 #include "access/xlogrecord.h"
 
+/*
+ * WALOpenSegment represents a WAL segment being read.
+ */
+typedef struct WALOpenSegment
+{
+	int			file;			/* segment file descriptor */
+	XLogSegNo	num;			/* segment number */
+	uint32		off;			/* offset in the segment */
+	TimeLineID	tli;			/* timeline ID of the currently open file */
+	int			size;			/* segment size */
+
+#ifdef FRONTEND
+	char	   *dir;			/* WAL directory */
+#endif
+} WALOpenSegment;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -76,11 +92,6 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
 	/*
 	 * Data input callback (mandatory).
 	 *
@@ -98,8 +109,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set ->readPageTLI to the TLI of the file the page
-	 * was read from.
+	 * The callback shall set ->seg.tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -154,10 +165,8 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	WALOpenSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -217,6 +226,9 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
+extern void WALOpenSegmentInit(WALOpenSegment *seg, int size);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.20.1

v06-0004-Introduce-InvalidTimeLineID.patchtext/x-diffDownload

From 29d7e8c46fddf2c95b6dc0ff702912be5516e881 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 23 Sep 2019 07:40:49 +0200
Subject: [PATCH 4/6] Introduce InvalidTimeLineID.

This is useful to tell the WALSegmentOpen() callback that it should determine
TLI itself. We can't pass NULL pointer because the function should return the
TLI via the argument in such a case.
---
 src/backend/access/transam/timeline.c | 10 ++++++++--
 src/backend/access/transam/xlog.c     |  1 +
 src/include/access/xlogdefs.h         |  4 ++++
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c2ba480c70..7cb6046835 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -254,6 +254,8 @@ findNewestTimeLine(TimeLineID startTLI)
 	/*
 	 * The algorithm is just to probe for the existence of timeline history
 	 * files.  XXX is it useful to allow gaps in the sequence?
+	 *
+	 * XXX Should we check whether all possible timelines are in use?
 	 */
 	newestTLI = startTLI;
 
@@ -263,9 +265,13 @@ findNewestTimeLine(TimeLineID startTLI)
 		{
 			newestTLI = probeTLI;	/* probeTLI exists */
 		}
-		else
+		else if (probeTLI != MaxTimeLineID)
 		{
-			/* doesn't exist, assume we're done */
+			/*
+			 * Doesn't exist, assume we're done, but do not return
+			 * MaxTimeLineID - if caller incremented that, it'd become
+			 * InvalidTimeLineID.
+			 */
 			break;
 		}
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 031733aba8..c9e01fe82d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7421,6 +7421,7 @@ StartupXLOG(void)
 		Assert(InArchiveRecovery);
 
 		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
+		Assert(ThisTimeLineID != InvalidTimeLineID);
 		ereport(LOG,
 				(errmsg("selected new timeline ID: %u", ThisTimeLineID)));
 
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index daded3dca0..459dc99d3e 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -51,6 +51,10 @@ typedef uint64 XLogSegNo;
  */
 typedef uint32 TimeLineID;
 
+#define InvalidTimeLineID		((TimeLineID) 0xFFFFFFFF)
+
+#define MaxTimeLineID			((TimeLineID) 0xFFFFFFFE)
+
 /*
  * Replication origin id - this is located in this file to avoid having to
  * include origin.h in a bunch of xlog related places.
-- 
2.20.1

v06-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 0c6ae68ccab9507a7294ccd63c1b40561e889e9f Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 23 Sep 2019 07:40:49 +0200
Subject: [PATCH 5/6] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 157 ++++++++++++++++++++++++
 src/backend/access/transam/xlogutils.c  |  46 ++++++-
 src/backend/replication/walsender.c     | 139 ++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  64 +++++++++-
 src/include/access/xlogreader.h         |  42 +++++++
 5 files changed, 436 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4de5530b3e..7f77fa95cb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1015,6 +1018,160 @@ WALOpenSegmentInit(WALOpenSegment *seg, int size)
 #endif
 }
 
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli_p is passed, get the data from timeline *tli_p. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * Returns error information if the data could not be read or NULL if
+ * succeeded.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count,
+		 TimeLineID *tli_p, WALOpenSegment *seg, WALSegmentOpen openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			segbytes;
+		int			readbytes;
+
+		seg->off = XLogSegmentOffset(recptr, seg->size);
+
+		if (seg->file < 0 ||
+			!XLByteInSeg(recptr, seg->num, seg->size) ||
+			(tli_p != NULL && *tli_p != seg->tli))
+		{
+			XLogSegNo	nextSegNo;
+			TimeLineID	tli = InvalidTimeLineID;
+			int			file;
+
+			/* Switch to another logfile segment */
+			if (seg->file >= 0)
+				close(seg->file);
+
+			XLByteToSeg(recptr, nextSegNo, seg->size);
+
+			/* If we have the TLI, let's pass it to the callback. */
+			if (tli_p != NULL)
+				tli = *tli_p;
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, &tli, &file, seg);
+
+			/*
+			 * If we passed InvalidTimeLineID, the callback should have
+			 * determined the correct TLI and returned it.
+			 */
+			Assert(tli != InvalidTimeLineID);
+
+			/* Update the open segment info. */
+			seg->tli = tli;
+			seg->file = file;
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "num" and "off". However we need to care
+			 * about them too because the function can also be used directly,
+			 * see walsender.c.
+			 */
+			seg->num = nextSegNo;
+			seg->off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (seg->size - seg->off))
+			segbytes = seg->size - seg->off;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->file, p, segbytes, seg->off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			XLogReadError *errinfo;
+
+			errinfo = (XLogReadError *) palloc(sizeof(XLogReadError));
+			errinfo->read_errno = errno;
+			errinfo->readbytes = readbytes;
+			errinfo->reqbytes = segbytes;
+			errinfo->seg = seg;
+
+			return errinfo;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->off += readbytes;
+	}
+
+	return NULL;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadProcessError(XLogReadError *errinfo)
+{
+	WALOpenSegment *seg = errinfo->seg;
+
+	if (errinfo->readbytes < 0)
+	{
+		errno = errinfo->read_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) errinfo->reqbytes)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->tli, seg->num), seg->off,
+						(Size) errinfo->reqbytes)));
+	}
+}
+#endif
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 424bb06919..38c3196168 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,39 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_segment_open(XLogSegNo nextSegNo, TimeLineID *tli_p,
+								  int *file_p, WALOpenSegment *seg)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	Assert(tli != InvalidTimeLineID);
+
+	XLogFilePath(path, tli, nextSegNo, seg->size);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -915,6 +948,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 				loc;
 	int			count;
 	TimeLineID	pageTLI;
+	XLogReadError *errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1022,10 +1056,10 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-			 XLOG_BLCKSZ);
-	state->seg.tli = pageTLI;
-
+	if ((errinfo = XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, &pageTLI,
+							&state->seg,
+							read_local_xlog_page_segment_open)) != NULL)
+		XLogReadProcessError(errinfo);
 	/* number of valid bytes in the buffer */
 	return count;
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a617e20ab6..a7b3e0ecbe 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -247,9 +247,12 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p,
+							  int *file_p, WALOpenSegment *seg);
 
 
+static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -763,6 +766,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	XLogReadError *errinfo;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -783,7 +787,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if ((errinfo = XLogRead(cur_page,
+							targetPagePtr,
+							XLOG_BLCKSZ,
+							NULL,	/* WalSndSegmentOpen will determine TLI */
+							sendSeg,
+							WalSndSegmentOpen)) != NULL)
+		XLogReadProcessError(errinfo);
 
 	return count;
 }
@@ -2360,7 +2370,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2533,6 +2543,81 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
+				  WALOpenSegment *seg)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	/*
+	 * The timeline is determined below, caller should not pass it.
+	 */
+	Assert(tli == InvalidTimeLineID);
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	tli = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, seg->size);
+		if (seg->num == endSegNo)
+			tli = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, tli, nextSegNo, seg->size);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(tli, nextSegNo))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+	*tli_p = tli;
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2550,6 +2635,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	XLogReadError *errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2765,7 +2852,51 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if ((errinfo = XLogRead(&output_message.data[output_message.len],
+							startptr,
+							nbytes,
+							NULL,	/* WalSndSegmentOpen will determine TLI */
+							sendSeg,
+							WalSndSegmentOpen)) != NULL)
+		XLogReadProcessError(errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(startptr, segno, wal_segment_size);
+	CheckXLogRemoved(segno, ThisTimeLineID);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->file >= 0)
+		{
+			close(sendSeg->file);
+			sendSeg->file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a16793bb8b..cd5f589f03 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -296,6 +296,51 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not find any WAL file");
 }
 
+static void
+WALDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
+				   WALOpenSegment *seg)
+{
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			file;
+	int			tries;
+
+	Assert(tli != InvalidTimeLineID);
+
+	XLogFileName(fname, tli, nextSegNo, seg->size);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		file = open_file_in_directory(seg->dir, fname);
+		if (file >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (file < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+
+	*file_p = file;
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -427,6 +472,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	XLogReadError *errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -441,8 +487,22 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if ((errinfo = XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+							&state->seg, WALDumpOpenSegment)) != NULL)
+	{
+		WALOpenSegment *seg = errinfo->seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->tli, seg->num, seg->size);
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->off, (Size) errinfo->reqbytes,
+						strerror(errinfo->read_errno));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->off, (Size) errinfo->reqbytes);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b9d99d524e..3d9742b81b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -227,8 +227,50 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/*
+ * Callback to open the specified WAL segment for reading.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "tli_p" is an input/output argument. If *tli_p is valid, it's the timeline
+ * the new segment should be in. If *tli_p==InvalidTimeLineID, the callback
+ * needs to determine the timeline itself and put the result into *tli_p.
+ *
+ * "file_p" points to an address the segment file descriptor should be stored
+ * at.
+ *
+ * "seg" provides information on the currently open segment. The callback is
+ * not supposed to change this info.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, TimeLineID *tli_p,
+								int *file_p, WALOpenSegment *seg);
+
 extern void WALOpenSegmentInit(WALOpenSegment *seg, int size);
 
+/*
+ * Error information that both backend and frontend caller can process.
+ *
+ * XXX Should the name be WALReadError? If so, we probably need to rename
+ * XLogRead() and XLogReadProcessError() too.
+ */
+typedef struct XLogReadError
+{
+	int			read_errno;		/* errno set by the last read(). */
+	int			readbytes;		/* Bytes read by the last read(). */
+	int			reqbytes;		/* Bytes requested to be read. */
+	WALOpenSegment *seg;		/* Segment we tried to read from. */
+} XLogReadError;
+
+extern XLogReadError *XLogRead(char *buf, XLogRecPtr startptr, Size count,
+							   TimeLineID *tli_p, WALOpenSegment *seg,
+							   WALSegmentOpen openSegment);
+#ifndef FRONTEND
+void		XLogReadProcessError(XLogReadError *errinfo);
+#endif
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.20.1

v06-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From 0a0dc5fcf6fb570eb6ec4d145fe931beb5d21d93 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 23 Sep 2019 07:40:49 +0200
Subject: [PATCH 6/6] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 125 -----------------
 src/backend/replication/walsender.c    | 187 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 123 ----------------
 3 files changed, 435 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38c3196168..078d58e34c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -17,14 +17,11 @@
  */
 #include "postgres.h"
 
-#include <unistd.h>
-
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -639,128 +636,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a7b3e0ecbe..3ea05add53 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -251,8 +251,6 @@ static void WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p,
 							  int *file_p, WALOpenSegment *seg);
 
 
-static void XLogReadOld(char *buf, XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2358,191 +2356,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
-
-		if (sendSeg->file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->num, sendSeg->size))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->file >= 0)
-				close(sendSeg->file);
-
-			XLByteToSeg(recptr, sendSeg->num, sendSeg->size);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSeg->num == endSegNo)
-					sendSeg->tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->tli, sendSeg->num, wal_segment_size);
-
-			sendSeg->file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->tli, sendSeg->num))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->off != startoff)
-		{
-			if (lseek(sendSeg->file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->tli, sendSeg->num),
-								startoff)));
-			sendSeg->off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->tli, sendSeg->num),
-							sendSeg->off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->file >= 0)
-		{
-			close(sendSeg->file);
-			sendSeg->file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index cd5f589f03..1de8eb350e 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -14,7 +14,6 @@
 
 #include <dirent.h>
 #include <sys/stat.h>
-#include <unistd.h>
 
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
@@ -341,128 +340,6 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
 	*file_p = file;
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.20.1

#21

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#20)

Re: Attempt to consolidate reading of XLOG page

I spent a couple of hours on this patchset today. I merged 0001 and
0002, and decided the result was still messier than I would have liked,
so I played with it a bit more -- see attached. I think this is
committable, but I'm afraid it'll cause quite a few conflicts with the
rest of your series.

I had two gripes, which I feel solved with my changes:

1. I didn't like that "dir" and "wal segment size" were part of the
"currently open segment" supporting struct. It seemed that those two
were slightly higher-level, since they apply to every segment that's
going to be opened, not just the current one.

My first thought was to put those as members of XLogReaderState, but
that doesn't work because the physical walsender.c code does not use
xlogreader at all, even though it is reading WAL. Anyway my solution
was to create yet another struct, which for everything that uses
xlogreader is just part of that state struct; and for walsender, it's
just a separate one alongside sendSeg. All in all, this seems pretty
clean.

2. Having the wal dir be #ifdef FRONTEND seemed out of place. I know
the backend code does not use that, but eliding it is more "noisy" than
just setting it to NULL. Also, the "Finalize the segment pointer"
thingy seemed out of place. So my code passes the dir as an argument to
XLogReaderAllocate, and if it's null then we just don't allocate it.
Everybody else can use it to guide things. This results in cleaner
code, because we don't have to handle it externally, which was causing
quite some pain to pg_waldump.

Note that ws_dir member is a char array in the struct, not just a
pointer. This saves trouble trying to allocate it (I mainly did it this
way because we don't have pstrdup_extended(MCXT_ALLOC_NO_OOM) ... yes,
this could be made with palloc+snprintf, but eh, that doesn't seem worth
the trouble.)

Separately from those two API-wise points, there was one bug which meant
that with your 0002+0003 the recovery tests did not pass -- code
placement bug. I suppose the bug disappears with later patches in your
series, which probably is why you didn't notice. This is the fix for that:

-   XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-   state->seg.tli = pageTLI;
+   state->seg.ws_tli = pageTLI;
+   XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
             XLOG_BLCKSZ);

... Also, yes, I renamed all the struct members.

If you don't have any strong dislikes for these changes, I'll push this
part and let you rebase the remains on top.

Regarding the other patches:

1. I think trying to do palloc(XLogReadError) is a bad idea ... for
example, if the read fails because of system pressure, we might return
"out of memory" during that palloc instead of the real read error. This
particular problem you could forestall by changing to ErrorContext, but
I have the impression that it might be better to have the error struct
by stack-allocated in the caller stack. This forces you to limit the
message string to a maximum size (say 128 bytes or maybe even 1000 bytes
like MAX_ERRORMSG_LEN) but I don't have a problem with that.

2. Not a fan of the InvalidTimeLineID stuff offhand. Maybe it's okay ...
not convinced yet either way.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#22

Alvaro Herrera from 2ndQuadrant

alvherre@alvh.no-ip.org

over 6 years ago

In reply to: Alvaro Herrera (#21)

1 attachment(s)

Re: Attempt to consolidate reading of XLOG page

On 2019-Sep-23, Alvaro Herrera wrote:

I spent a couple of hours on this patchset today. I merged 0001 and
0002, and decided the result was still messier than I would have liked,
so I played with it a bit more -- see attached.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0002-0003.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477709bbc2..0635a17eae 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1386,8 +1386,8 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page,
-									NULL);
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									&read_local_xlog_page, NULL);
 	if (!xlogreader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 501f46fd52..6c69eb6dd7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -885,8 +885,7 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-						 TimeLineID *readTLI);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -1195,7 +1194,8 @@ XLogInsertRecord(XLogRecData *rdata,
 			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
 
 		if (!debug_reader)
-			debug_reader = XLogReaderAllocate(wal_segment_size, NULL, NULL);
+			debug_reader = XLogReaderAllocate(wal_segment_size, NULL,
+											  NULL, NULL);
 
 		if (!debug_reader)
 		{
@@ -4296,7 +4296,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLByteToSeg(xlogreader->latestPagePtr, segno, wal_segment_size);
 			offset = XLogSegmentOffset(xlogreader->latestPagePtr,
 									   wal_segment_size);
-			XLogFileName(fname, xlogreader->readPageTLI, segno,
+			XLogFileName(fname, xlogreader->seg.ws_tli, segno,
 						 wal_segment_size);
 			ereport(emode_for_corrupt_record(emode,
 											 RecPtr ? RecPtr : EndRecPtr),
@@ -6353,7 +6353,8 @@ StartupXLOG(void)
 
 	/* Set up XLOG reader facility */
 	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
-	xlogreader = XLogReaderAllocate(wal_segment_size, &XLogPageRead, &private);
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									&XLogPageRead, &private);
 	if (!xlogreader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -7355,7 +7356,7 @@ StartupXLOG(void)
 	 * and we were reading the old WAL from a segment belonging to a higher
 	 * timeline.
 	 */
-	EndOfLogTLI = xlogreader->readPageTLI;
+	EndOfLogTLI = xlogreader->seg.ws_tli;
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -11523,7 +11524,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11640,7 +11641,7 @@ retry:
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
+	xlogreader->seg.ws_tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a66e3324b1..27c27303d6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -68,8 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
  * Returns NULL if the xlogreader couldn't be allocated.
  */
 XLogReaderState *
-XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
-				   void *private_data)
+XLogReaderAllocate(int wal_segment_size, const char *waldir,
+				   XLogPageReadCB pagereadfunc, void *private_data)
 {
 	XLogReaderState *state;
 
@@ -96,7 +96,10 @@ XLogReaderAllocate(int wal_segment_size, XLogPageReadCB pagereadfunc,
 		return NULL;
 	}
 
-	state->wal_segment_size = wal_segment_size;
+	/* Initialize segment info. */
+	WALOpenSegmentInit(&state->seg, &state->segcxt, wal_segment_size,
+					   waldir);
+
 	state->read_page = pagereadfunc;
 	/* system_identifier initialized to zeroes above */
 	state->private_data = private_data;
@@ -198,6 +201,23 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 	return true;
 }
 
+/*
+ * Initialize the passed segment structs.
+ */
+void
+WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
+				   int segsize, const char *waldir)
+{
+	seg->ws_file = -1;
+	seg->ws_segno = 0;
+	seg->ws_off = 0;
+	seg->ws_tli = 0;
+
+	segcxt->ws_segsize = segsize;
+	if (waldir)
+		snprintf(segcxt->ws_dir, MAXPGPATH, "%s", waldir);
+}
+
 /*
  * Attempt to read an XLOG record.
  *
@@ -490,8 +510,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->wal_segment_size - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->wal_segment_size);
+		state->EndRecPtr += state->segcxt.ws_segsize - 1;
+		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
 	}
 
 	if (DecodeXLogRecord(state, record, errormsg))
@@ -533,12 +553,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(pageptr, targetSegNo, state->wal_segment_size);
-	targetPageOff = XLogSegmentOffset(pageptr, state->wal_segment_size);
+	XLByteToSeg(pageptr, targetSegNo, state->segcxt.ws_segsize);
+	targetPageOff = XLogSegmentOffset(pageptr, state->segcxt.ws_segsize);
 
 	/* check whether we have all the requested data already */
-	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
-		reqLen <= state->readLen)
+	if (targetSegNo == state->seg.ws_segno &&
+		targetPageOff == state->seg.ws_off && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -553,13 +573,13 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->readSegNo && targetPageOff != 0)
+	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 
@@ -577,7 +597,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 							   state->currRecPtr,
-							   state->readBuf, &state->readPageTLI);
+							   state->readBuf);
 	if (readLen < 0)
 		goto err;
 
@@ -596,7 +616,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
 								   state->currRecPtr,
-								   state->readBuf, &state->readPageTLI);
+								   state->readBuf);
 		if (readLen < 0)
 			goto err;
 	}
@@ -608,8 +628,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		goto err;
 
 	/* update read state information */
-	state->readSegNo = targetSegNo;
-	state->readOff = targetPageOff;
+	state->seg.ws_segno = targetSegNo;
+	state->seg.ws_off = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -625,8 +645,8 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->readSegNo = 0;
-	state->readOff = 0;
+	state->seg.ws_segno = 0;
+	state->seg.ws_off = 0;
 	state->readLen = 0;
 }
 
@@ -745,16 +765,16 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	Assert((recptr % XLOG_BLCKSZ) == 0);
 
-	XLByteToSeg(recptr, segno, state->wal_segment_size);
-	offset = XLogSegmentOffset(recptr, state->wal_segment_size);
+	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
+	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
-	XLogSegNoOffsetToRecPtr(segno, offset, state->wal_segment_size, recaddr);
+	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
 							  "invalid magic number %04X in log segment %s, offset %u",
@@ -768,7 +788,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
 							  "invalid info bits %04X in log segment %s, offset %u",
@@ -791,7 +811,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 								  (unsigned long long) state->system_identifier);
 			return false;
 		}
-		else if (longhdr->xlp_seg_size != state->wal_segment_size)
+		else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
 		{
 			report_invalid_record(state,
 								  "WAL file is from different database system: incorrect segment size in page header");
@@ -808,7 +828,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
@@ -828,7 +848,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	{
 		char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
 							  "unexpected pageaddr %X/%X in log segment %s, offset %u",
@@ -853,7 +873,7 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		{
 			char		fname[MAXFNAMELEN];
 
-			XLogFileName(fname, state->readPageTLI, segno, state->wal_segment_size);
+			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
 								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
@@ -997,7 +1017,6 @@ out:
 
 #endif							/* FRONTEND */
 
-
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..d3967c64b2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -802,8 +802,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->readSegNo *
-	state->wal_segment_size + state->readOff;
+	const XLogRecPtr lastReadPage = state->seg.ws_segno *
+	state->segcxt.ws_segsize + state->seg.ws_off;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -847,8 +847,8 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	if (state->currTLIValidUntil != InvalidXLogRecPtr &&
 		state->currTLI != ThisTimeLineID &&
 		state->currTLI != 0 &&
-		((wantPage + wantLength) / state->wal_segment_size) <
-		(state->currTLIValidUntil / state->wal_segment_size))
+		((wantPage + wantLength) / state->segcxt.ws_segsize) <
+		(state->currTLIValidUntil / state->segcxt.ws_segsize))
 		return;
 
 	/*
@@ -869,12 +869,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 		 * by a promotion or replay from a cascaded replica.
 		 */
 		List	   *timelineHistory = readTimeLineHistory(ThisTimeLineID);
+		XLogRecPtr	endOfSegment;
 
-		XLogRecPtr	endOfSegment = (((wantPage / state->wal_segment_size) + 1)
-									* state->wal_segment_size) - 1;
-
-		Assert(wantPage / state->wal_segment_size ==
-			   endOfSegment / state->wal_segment_size);
+		endOfSegment = ((wantPage / state->segcxt.ws_segsize) + 1) *
+			state->segcxt.ws_segsize - 1;
+		Assert(wantPage / state->segcxt.ws_segsize ==
+			   endOfSegment / state->segcxt.ws_segsize);
 
 		/*
 		 * Find the timeline of the last LSN on the segment containing
@@ -909,12 +909,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
-					 TimeLineID *pageTLI)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	TimeLineID	pageTLI;
 
 	loc = targetPagePtr + reqLen;
 
@@ -934,7 +934,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
 
-		*pageTLI = ThisTimeLineID;
+		pageTLI = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -991,7 +991,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			 * nothing cares so long as the timeline doesn't go backwards.  We
 			 * should read the page header instead; FIXME someday.
 			 */
-			*pageTLI = state->currTLI;
+			pageTLI = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1022,7 +1022,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr,
+	state->seg.ws_tli = pageTLI;
+	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
 			 XLOG_BLCKSZ);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f8b9020081..da265f5294 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -173,7 +173,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index d974400d6e..d1cf80d441 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -116,10 +116,10 @@ check_permissions(void)
 
 int
 logical_read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+							 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
 	return read_local_xlog_page(state, targetPagePtr, reqLen,
-								targetRecPtr, cur_page, pageTLI);
+								targetRecPtr, cur_page);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23870a25a5..eb4a98cc91 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -128,16 +128,8 @@ bool		log_replication_commands = false;
  */
 bool		wake_wal_senders = false;
 
-/*
- * These variables are used similarly to openLogFile/SegNo/Off,
- * but for walsender to read the XLOG.
- */
-static int	sendFile = -1;
-static XLogSegNo sendSegNo = 0;
-static uint32 sendOff = 0;
-
-/* Timeline ID of the currently open file */
-static TimeLineID curFileTimeLine = 0;
+static WALOpenSegment *sendSeg = NULL;
+static WALSegmentContext *sendCxt = NULL;
 
 /*
  * These variables keep track of the state of the timeline we're currently
@@ -256,7 +248,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -285,6 +277,13 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
+
+	/* Make sure we can remember the current read position in XLOG. */
+	sendSeg = (WALOpenSegment *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(WALOpenSegment));
+	sendCxt = (WALSegmentContext *)
+		MemoryContextAlloc(TopMemoryContext, sizeof(WALSegmentContext));
+	WALOpenSegmentInit(sendSeg, sendCxt, wal_segment_size, NULL);
 }
 
 /*
@@ -301,10 +300,10 @@ WalSndErrorCleanup(void)
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
 
-	if (sendFile >= 0)
+	if (sendSeg->ws_file >= 0)
 	{
-		close(sendFile);
-		sendFile = -1;
+		close(sendSeg->ws_file);
+		sendSeg->ws_file = -1;
 	}
 
 	if (MyReplicationSlot != NULL)
@@ -763,7 +762,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
+					   XLogRecPtr targetRecPtr, char *cur_page)
 {
 	XLogRecPtr	flushptr;
 	int			count;
@@ -787,7 +786,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
 
 	return count;
 }
@@ -2364,7 +2363,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2382,17 +2381,18 @@ retry:
 		int			segbytes;
 		int			readbytes;
 
-		startoff = XLogSegmentOffset(recptr, wal_segment_size);
+		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, wal_segment_size))
+		if (sendSeg->ws_file < 0 ||
+			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
 		{
 			char		path[MAXPGPATH];
 
 			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
+			if (sendSeg->ws_file >= 0)
+				close(sendSeg->ws_file);
 
-			XLByteToSeg(recptr, sendSegNo, wal_segment_size);
+			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
 
 			/*-------
 			 * When reading from a historic timeline, and there is a timeline
@@ -2420,20 +2420,20 @@ retry:
 			 * used portion of the old segment is copied to the new file.
 			 *-------
 			 */
-			curFileTimeLine = sendTimeLine;
+			sendSeg->ws_tli = sendTimeLine;
 			if (sendTimeLineIsHistoric)
 			{
 				XLogSegNo	endSegNo;
 
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, wal_segment_size);
-				if (sendSegNo == endSegNo)
-					curFileTimeLine = sendTimeLineNextTLI;
+				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+				if (sendSeg->ws_segno == endSegNo)
+					sendSeg->ws_tli = sendTimeLineNextTLI;
 			}
 
-			XLogFilePath(path, curFileTimeLine, sendSegNo, wal_segment_size);
+			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
 
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendFile < 0)
+			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+			if (sendSeg->ws_file < 0)
 			{
 				/*
 				 * If the file is not found, assume it's because the standby
@@ -2444,58 +2444,58 @@ retry:
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(curFileTimeLine, sendSegNo))));
+									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("could not open file \"%s\": %m",
 									path)));
 			}
-			sendOff = 0;
+			sendSeg->ws_off = 0;
 		}
 
 		/* Need to seek in the file? */
-		if (sendOff != startoff)
+		if (sendSeg->ws_off != startoff)
 		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
+			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(curFileTimeLine, sendSegNo),
+								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
 								startoff)));
-			sendOff = startoff;
+			sendSeg->ws_off = startoff;
 		}
 
 		/* How many bytes are within this segment? */
-		if (nbytes > (wal_segment_size - startoff))
-			segbytes = wal_segment_size - startoff;
+		if (nbytes > (segcxt->ws_segsize - startoff))
+			segbytes = segcxt->ws_segsize - startoff;
 		else
 			segbytes = nbytes;
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
+		readbytes = read(sendSeg->ws_file, p, segbytes);
 		pgstat_report_wait_end();
 		if (readbytes < 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, (Size) segbytes)));
+							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
+							sendSeg->ws_off, (Size) segbytes)));
 		}
 		else if (readbytes == 0)
 		{
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(curFileTimeLine, sendSegNo),
-							sendOff, readbytes, (Size) segbytes)));
+							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
+							sendSeg->ws_off, readbytes, (Size) segbytes)));
 		}
 
 		/* Update state for read */
 		recptr += readbytes;
 
-		sendOff += readbytes;
+		sendSeg->ws_off += readbytes;
 		nbytes -= readbytes;
 		p += readbytes;
 	}
@@ -2507,7 +2507,7 @@ retry:
 	 * read() succeeds in that case, but the data we tried to read might
 	 * already have been overwritten with new WAL records.
 	 */
-	XLByteToSeg(startptr, segno, wal_segment_size);
+	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
 	CheckXLogRemoved(segno, ThisTimeLineID);
 
 	/*
@@ -2526,10 +2526,10 @@ retry:
 		walsnd->needreload = false;
 		SpinLockRelease(&walsnd->mutex);
 
-		if (reload && sendFile >= 0)
+		if (reload && sendSeg->ws_file >= 0)
 		{
-			close(sendFile);
-			sendFile = -1;
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
 
 			goto retry;
 		}
@@ -2695,9 +2695,9 @@ XLogSendPhysical(void)
 	if (sendTimeLineIsHistoric && sendTimeLineValidUpto <= sentPtr)
 	{
 		/* close the current file. */
-		if (sendFile >= 0)
-			close(sendFile);
-		sendFile = -1;
+		if (sendSeg->ws_file >= 0)
+			close(sendSeg->ws_file);
+		sendSeg->ws_file = -1;
 
 		/* Send CopyDone */
 		pq_putmessage_noblock('c', NULL, 0);
@@ -2768,7 +2768,7 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 63c3879ead..264a8f4db5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -43,14 +43,12 @@ static char xlogfpath[MAXPGPATH];
 
 typedef struct XLogPageReadPrivate
 {
-	const char *datadir;
 	int			tliIndex;
 } XLogPageReadPrivate;
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-							   TimeLineID *pageTLI);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -66,9 +64,8 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	char	   *errormsg;
 	XLogPageReadPrivate private;
 
-	private.datadir = datadir;
 	private.tliIndex = tliIndex;
-	xlogreader = XLogReaderAllocate(WalSegSz, &SimpleXLogPageRead,
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, &SimpleXLogPageRead,
 									&private);
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
@@ -119,9 +116,8 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex)
 	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
-	private.datadir = datadir;
 	private.tliIndex = tliIndex;
-	xlogreader = XLogReaderAllocate(WalSegSz, &SimpleXLogPageRead,
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, &SimpleXLogPageRead,
 									&private);
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
@@ -177,9 +173,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 			forkptr += SizeOfXLogShortPHD;
 	}
 
-	private.datadir = datadir;
 	private.tliIndex = tliIndex;
-	xlogreader = XLogReaderAllocate(WalSegSz, &SimpleXLogPageRead,
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, &SimpleXLogPageRead,
 									&private);
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
@@ -237,8 +232,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
-				   TimeLineID *pageTLI)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -283,7 +277,8 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 		XLogFileName(xlogfname, targetHistory[private->tliIndex].tli,
 					 xlogreadsegno, WalSegSz);
 
-		snprintf(xlogfpath, MAXPGPATH, "%s/" XLOGDIR "/%s", private->datadir, xlogfname);
+		snprintf(xlogfpath, MAXPGPATH, "%s/" XLOGDIR "/%s",
+				 xlogreader->segcxt.ws_dir, xlogfname);
 
 		xlogreadfd = open(xlogfpath, O_RDONLY | PG_BINARY, 0);
 
@@ -321,7 +316,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	*pageTLI = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.ws_tli = targetHistory[private->tliIndex].tli;
 	return XLOG_BLCKSZ;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b95d467805..b79208cd73 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -33,7 +33,6 @@ static int	WalSegSz;
 typedef struct XLogDumpPrivate
 {
 	TimeLineID	timeline;
-	char	   *inpath;
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	bool		endptr_reached;
@@ -224,7 +223,7 @@ search_directory(const char *directory, const char *fname)
 }
 
 /*
- * Identify the target directory and set WalSegSz.
+ * Identify the target directory.
  *
  * Try to find the file in several places:
  * if directory != NULL:
@@ -235,29 +234,22 @@ search_directory(const char *directory, const char *fname)
  *	 XLOGDIR /
  *	 $PGDATA / XLOGDIR /
  *
- * Set the valid target directory in private->inpath.
+ * The valid target directory is returned.
  */
-static void
-identify_target_directory(XLogDumpPrivate *private, char *directory,
-						  char *fname)
+static char *
+identify_target_directory(char *directory, char *fname)
 {
 	char		fpath[MAXPGPATH];
 
 	if (directory != NULL)
 	{
 		if (search_directory(directory, fname))
-		{
-			private->inpath = pg_strdup(directory);
-			return;
-		}
+			return pg_strdup(directory);
 
 		/* directory / XLOGDIR */
 		snprintf(fpath, MAXPGPATH, "%s/%s", directory, XLOGDIR);
 		if (search_directory(fpath, fname))
-		{
-			private->inpath = pg_strdup(fpath);
-			return;
-		}
+			return pg_strdup(fpath);
 	}
 	else
 	{
@@ -265,16 +257,10 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 
 		/* current directory */
 		if (search_directory(".", fname))
-		{
-			private->inpath = pg_strdup(".");
-			return;
-		}
+			return pg_strdup(".");
 		/* XLOGDIR */
 		if (search_directory(XLOGDIR, fname))
-		{
-			private->inpath = pg_strdup(XLOGDIR);
-			return;
-		}
+			return pg_strdup(XLOGDIR);
 
 		datadir = getenv("PGDATA");
 		/* $PGDATA / XLOGDIR */
@@ -282,10 +268,7 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		{
 			snprintf(fpath, MAXPGPATH, "%s/%s", datadir, XLOGDIR);
 			if (search_directory(fpath, fname))
-			{
-				private->inpath = pg_strdup(fpath);
-				return;
-			}
+				return pg_strdup(fpath);
 		}
 	}
 
@@ -294,6 +277,8 @@ identify_target_directory(XLogDumpPrivate *private, char *directory,
 		fatal_error("could not locate WAL file \"%s\"", fname);
 	else
 		fatal_error("could not find any WAL file");
+
+	return NULL;				/* not reached */
 }
 
 /*
@@ -423,7 +408,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
  */
 static int
 XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff, TimeLineID *curFileTLI)
+				 XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -441,7 +426,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
+	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
 					 readBuff, count);
 
 	return count;
@@ -820,6 +805,7 @@ main(int argc, char **argv)
 	XLogDumpStats stats;
 	XLogRecord *record;
 	XLogRecPtr	first_record;
+	char	   *waldir = NULL;
 	char	   *errormsg;
 
 	static struct option long_options[] = {
@@ -912,7 +898,7 @@ main(int argc, char **argv)
 				}
 				break;
 			case 'p':
-				private.inpath = pg_strdup(optarg);
+				waldir = pg_strdup(optarg);
 				break;
 			case 'r':
 				{
@@ -994,13 +980,13 @@ main(int argc, char **argv)
 		goto bad_argument;
 	}
 
-	if (private.inpath != NULL)
+	if (waldir != NULL)
 	{
 		/* validate path points to directory */
-		if (!verify_directory(private.inpath))
+		if (!verify_directory(waldir))
 		{
 			pg_log_error("path \"%s\" could not be opened: %s",
-						 private.inpath, strerror(errno));
+						 waldir, strerror(errno));
 			goto bad_argument;
 		}
 	}
@@ -1015,17 +1001,17 @@ main(int argc, char **argv)
 
 		split_path(argv[optind], &directory, &fname);
 
-		if (private.inpath == NULL && directory != NULL)
+		if (waldir == NULL && directory != NULL)
 		{
-			private.inpath = directory;
+			waldir = directory;
 
-			if (!verify_directory(private.inpath))
+			if (!verify_directory(waldir))
 				fatal_error("could not open directory \"%s\": %s",
-							private.inpath, strerror(errno));
+							waldir, strerror(errno));
 		}
 
-		identify_target_directory(&private, private.inpath, fname);
-		fd = open_file_in_directory(private.inpath, fname);
+		waldir = identify_target_directory(waldir, fname);
+		fd = open_file_in_directory(waldir, fname);
 		if (fd < 0)
 			fatal_error("could not open file \"%s\"", fname);
 		close(fd);
@@ -1056,7 +1042,7 @@ main(int argc, char **argv)
 			/* ignore directory, already have that */
 			split_path(argv[optind + 1], &directory, &fname);
 
-			fd = open_file_in_directory(private.inpath, fname);
+			fd = open_file_in_directory(waldir, fname);
 			if (fd < 0)
 				fatal_error("could not open file \"%s\"", fname);
 			close(fd);
@@ -1088,7 +1074,7 @@ main(int argc, char **argv)
 		}
 	}
 	else
-		identify_target_directory(&private, private.inpath, NULL);
+		waldir = identify_target_directory(waldir, NULL);
 
 	/* we don't know what to print */
 	if (XLogRecPtrIsInvalid(private.startptr))
@@ -1100,7 +1086,7 @@ main(int argc, char **argv)
 	/* done with argument parsing, do the actual work */
 
 	/* we have everything we need, start reading */
-	xlogreader_state = XLogReaderAllocate(WalSegSz, XLogDumpReadPage,
+	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, XLogDumpReadPage,
 										  &private);
 	if (!xlogreader_state)
 		fatal_error("out of memory");
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 735b1bd2fd..c5604f6841 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -31,6 +31,22 @@
 
 #include "access/xlogrecord.h"
 
+/* WALOpenSegment represents a WAL segment being read. */
+typedef struct WALOpenSegment
+{
+	int			ws_file;		/* segment file descriptor */
+	XLogSegNo	ws_segno;		/* segment number */
+	uint32		ws_off;			/* offset in the segment */
+	TimeLineID	ws_tli;			/* timeline ID of the currently open file */
+} WALOpenSegment;
+
+/* WALSegmentContext carries context information about WAL segments to read */
+typedef struct WALSegmentContext
+{
+	char		ws_dir[MAXPGPATH];
+	int			ws_segsize;
+} WALSegmentContext;
+
 typedef struct XLogReaderState XLogReaderState;
 
 /* Function type definition for the read_page callback */
@@ -38,8 +54,7 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf,
-							   TimeLineID *pageTLI);
+							   char *readBuf);
 
 typedef struct
 {
@@ -77,11 +92,6 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/*
-	 * Segment size of the to-be-parsed data (mandatory).
-	 */
-	int			wal_segment_size;
-
 	/*
 	 * Data input callback (mandatory).
 	 *
@@ -99,9 +109,8 @@ struct XLogReaderState
 	 * actual WAL record it's interested in.  In that case, targetRecPtr can
 	 * be used to determine which timeline to read the page from.
 	 *
-	 * The callback shall set *pageTLI to the TLI of the file the page was
-	 * read from.  It is currently used only for error reporting purposes, to
-	 * reconstruct the name of the WAL file where an error occurred.
+	 * The callback shall set ->seg.ws_tli to the TLI of the file the page was
+	 * read from.
 	 */
 	XLogPageReadCB read_page;
 
@@ -156,10 +165,9 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
-	/* last read segment, segment offset, TLI for data currently in readBuf */
-	XLogSegNo	readSegNo;
-	uint32		readOff;
-	TimeLineID	readPageTLI;
+	/* last read XLOG position for data currently in readBuf */
+	WALSegmentContext segcxt;
+	WALOpenSegment seg;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -202,12 +210,17 @@ struct XLogReaderState
 
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
+										   const char *waldir,
 										   XLogPageReadCB pagereadfunc,
 										   void *private_data);
 
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Initialize supporting structures */
+extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
+							   int segsize, const char *waldir);
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 XLogRecPtr recptr, char **errormsg);
@@ -219,6 +232,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 4105b59904..e85aa0bdc8 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,10 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page,
-								 TimeLineID *pageTLI);
+								 XLogRecPtr targetRecPtr, char *cur_page);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
-
 #endif
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
index a9c178a9e6..012096f183 100644
--- a/src/include/replication/logicalfuncs.h
+++ b/src/include/replication/logicalfuncs.h
@@ -14,6 +14,6 @@
 extern int	logical_read_local_xlog_page(XLogReaderState *state,
 										 XLogRecPtr targetPagePtr,
 										 int reqLen, XLogRecPtr targetRecPtr,
-										 char *cur_page, TimeLineID *pageTLI);
+										 char *cur_page);
 
 #endif

#23

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#21)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I spent a couple of hours on this patchset today. I merged 0001 and
0002, and decided the result was still messier than I would have liked,
so I played with it a bit more -- see attached. I think this is
committable, but I'm afraid it'll cause quite a few conflicts with the
rest of your series.

I had two gripes, which I feel solved with my changes:

1. I didn't like that "dir" and "wal segment size" were part of the
"currently open segment" supporting struct. It seemed that those two
were slightly higher-level, since they apply to every segment that's
going to be opened, not just the current one.

My first thought was to put those as members of XLogReaderState, but
that doesn't work because the physical walsender.c code does not use
xlogreader at all, even though it is reading WAL.

`I don't remember clearly but I think that this was the reason I tried to move
"wal_segment_size" away from XLogReaderState.

Separately from those two API-wise points, there was one bug which meant
that with your 0002+0003 the recovery tests did not pass -- code
placement bug. I suppose the bug disappears with later patches in your
series, which probably is why you didn't notice. This is the fix for that:
-   XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-   state->seg.tli = pageTLI;
+   state->seg.ws_tli = pageTLI;
+   XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
XLOG_BLCKSZ);

Yes, it seems so - the following parts ensure that XLogRead() adjusts the
timeline itself. I only checked that the each part of the series keeps the
source tree compilable. Thanks for fixing.

... Also, yes, I renamed all the struct members.

If you don't have any strong dislikes for these changes, I'll push this
part and let you rebase the remains on top.

No objections here.

2. Not a fan of the InvalidTimeLineID stuff offhand. Maybe it's okay ...
not convinced yet either way.

Well, it seems that the individual callbacks only use this constant in
Assert() statements. I'll consider if we really need it. The argument value
should not determine whether the callback derives the TLI or not.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#24

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#23)

Re: Attempt to consolidate reading of XLOG page

On 2019-Sep-24, Antonin Houska wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

If you don't have any strong dislikes for these changes, I'll push this
part and let you rebase the remains on top.

No objections here.

oK, pushed. Please rebase the other parts.

I made one small adjustment: in read_local_xlog_page() there was one
*readTLI output parameter that was being changed to a local variable
plus later assigment to the output struct member; I changed the code to
continue to assign directly to the output variable instead. There was
an error case in which the TLI was not assigned to; I suppose this
doesn't really change things (we don't examine the TLI in that case, do
we?), but it seemed dangerous to leave like that.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#25

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#24)

2 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Sep-24, Antonin Houska wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

If you don't have any strong dislikes for these changes, I'll push this
part and let you rebase the remains on top.

No objections here.

oK, pushed. Please rebase the other parts.

Thanks!

I made one small adjustment: in read_local_xlog_page() there was one
*readTLI output parameter that was being changed to a local variable
plus later assigment to the output struct member; I changed the code to
continue to assign directly to the output variable instead. There was
an error case in which the TLI was not assigned to; I suppose this
doesn't really change things (we don't examine the TLI in that case, do
we?), but it seemed dangerous to leave like that.

I used the local variable to make some expressions simpler, but missed the
fact that this way I can leave the ws_tli field unassigned if the function
returns prematurely. Now that I look closer, I see that it can be a problem -
in the case of ERROR, XLogReadRecord() does reset the state, but it does not
reset the TLI:

err:
/*
* Invalidate the read state. We might read from a different source after
* failure.
*/
XLogReaderInvalReadState(state);

Thus the TLI appears to be important even on ERROR, and what you've done is
correct. Thanks for fixing that.

One comment on the remaining part of the series:

Before this refactoring, the walsender.c:XLogRead() function contained these
lines

/*
* After reading into the buffer, check that what we read was valid. We do
* this after reading, because even though the segment was present when we
* opened it, it might get recycled or removed while we read it. The
* read() succeeds in that case, but the data we tried to read might
* already have been overwritten with new WAL records.
*/
XLByteToSeg(startptr, segno, segcxt->ws_segsize);
CheckXLogRemoved(segno, ThisTimeLineID);

but they don't fit into the new, generic implementation, so I copied these
lines to the two places right after the call of the new XLogRead(). However I
was not sure if ThisTimeLineID was ever correct here. It seems the original
walsender.c:XLogRead() implementation did not update ThisTimeLineID (and
therefore neither the new callback WalSndSegmentOpen() does), so both
logical_read_xlog_page() and XLogSendPhysical() could read the data from
another (historic) timeline. I think we should check the segment we really
read data from:

CheckXLogRemoved(segno, sendSeg->ws_tli);

The rebased code is attached.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v07-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From c8017f48a186be9e0f879573b0c0880e6cc7df44 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Thu, 26 Sep 2019 13:51:53 +0200
Subject: [PATCH 1/2] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 153 ++++++++++++++++++++++++
 src/backend/access/transam/xlogutils.c  |  45 ++++++-
 src/backend/replication/walsender.c     | 138 ++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  60 +++++++++-
 src/include/access/xlogreader.h         |  47 ++++++++
 5 files changed, 433 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c8b0d2303d..a58c5d1633 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1016,6 +1019,156 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'. If
+ * tli_p is passed, get the data from timeline *tli_p. 'pos' is the current
+ * position in the XLOG file and openSegment is a callback that opens the next
+ * segment for reading.
+ *
+ * Returns error information if the data could not be read or NULL if
+ * succeeded.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+		 WALOpenSegment *seg, WALSegmentContext *segcxt,
+		 WALSegmentOpen openSegment)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+	static XLogReadError errinfo;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			segbytes;
+		int			readbytes;
+
+		seg->ws_off = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			(tli_p != NULL && *tli_p != seg->ws_tli))
+		{
+			XLogSegNo	nextSegNo;
+			TimeLineID	tli;
+			int			file;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/*
+			 * If we have the TLI, let's pass it to the callback. If NULL is
+			 * passed, the callback has to find the TLI itself.
+			 */
+			if (tli_p != NULL)
+				tli = *tli_p;
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, &tli, &file, seg, segcxt);
+
+			/* Update the open segment info. */
+			seg->ws_tli = tli;
+			seg->ws_file = file;
+
+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "ws_segno" and "ws_off", however the XLOG
+			 * reader is not necessarily involved. Furthermore, we need to set
+			 * the current values for this function to work.
+			 */
+			seg->ws_segno = nextSegNo;
+			seg->ws_off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - seg->ws_off))
+			segbytes = segcxt->ws_segsize - seg->ws_off;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->ws_file, p, segbytes, seg->ws_off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo.read_errno = errno;
+			errinfo.readbytes = readbytes;
+			errinfo.reqbytes = segbytes;
+			errinfo.seg = seg;
+			return &errinfo;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/*
+		 * If the function is called by the XLOG reader, the reader will
+		 * eventually set this field. However we need to care about it too
+		 * because the function can also be used directly (see walsender.c).
+		 */
+		seg->ws_off += readbytes;
+	}
+
+	return NULL;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadProcessError(XLogReadError *errinfo)
+{
+	WALOpenSegment *seg = errinfo->seg;
+
+	if (errinfo->readbytes < 0)
+	{
+		errno = errinfo->read_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off, (Size) errinfo->reqbytes)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off,
+						(Size) errinfo->reqbytes)));
+	}
+}
+#endif
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..09d42d3112 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,38 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_segment_open(XLogSegNo nextSegNo, TimeLineID *tli_p,
+								  int *file_p, WALOpenSegment *seg,
+								  WALSegmentContext *segcxt)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -914,6 +946,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	XLogRecPtr	read_upto,
 				loc;
 	int			count;
+	XLogReadError *errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1020,8 +1053,12 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if ((errinfo = XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ,
+							&state->seg.ws_tli,
+							&state->seg,
+							&state->segcxt,
+							read_local_xlog_page_segment_open)) != NULL)
+		XLogReadProcessError(errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eb4a98cc91..dcb84693fc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,9 +248,14 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
+static void WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p,
+							  int *file_p, WALOpenSegment *seg,
+							  WALSegmentContext *segcxt);
 
 
+static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
+						XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -766,6 +771,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	XLogReadError *errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -786,7 +793,24 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if ((errinfo = XLogRead(cur_page,
+							targetPagePtr,
+							XLOG_BLCKSZ,
+							NULL,	/* WalSndSegmentOpen will determine TLI */
+							sendSeg,
+							sendCxt,
+							WalSndSegmentOpen)) != NULL)
+		XLogReadProcessError(errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2363,7 +2387,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2536,6 +2560,71 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
+				  WALOpenSegment *seg, WALSegmentContext *segcxt)
+{
+	char		path[MAXPGPATH];
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (seg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	*file_p = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (*file_p < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(*tli_p, nextSegNo))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2553,6 +2642,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	XLogReadError *errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2768,7 +2859,46 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if ((errinfo = XLogRead(&output_message.data[output_message.len],
+							startptr,
+							nbytes,
+							NULL,	/* WalSndSegmentOpen will determine TLI */
+							sendSeg,
+							sendCxt,
+							WalSndSegmentOpen)) != NULL)
+		XLogReadProcessError(errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b79208cd73..c97376101c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -281,6 +281,46 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
+static void
+WALDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
+				   WALOpenSegment *seg, WALSegmentContext *segcxt)
+{
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		*file_p = open_file_in_directory(segcxt->ws_dir, fname);
+		if (*file_p >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (*file_p < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -412,6 +452,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	XLogReadError *errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -426,8 +467,23 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if ((errinfo = XLogRead(readBuff, targetPagePtr, count, &private->timeline,
+							&state->seg, &state->segcxt, WALDumpOpenSegment)) != NULL)
+	{
+		WALOpenSegment *seg = errinfo->seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->ws_off, (Size) errinfo->reqbytes,
+						strerror(errinfo->read_errno));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->ws_off, (Size) errinfo->reqbytes);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..bf4d105de3 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -218,6 +218,31 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 extern void XLogReaderFree(XLogReaderState *state);
 
 /* Initialize supporting structures */
+/*
+ * Callback to open the specified WAL segment for reading.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "tli_p" is an input/output argument. If *tli_p is valid, it's the timeline
+ * the new segment should be in. If *tli_p==InvalidTimeLineID, the callback
+ * needs to determine the timeline itself and put the result into *tli_p.
+ *
+ * "file_p" points to an address the segment file descriptor should be stored
+ * at.
+ *
+ * "seg" provides information on the currently open segment. The callback is
+ * not supposed to change this info.
+ *
+ * "segcxt" is additional information about the segment, which logically does
+ * not fit into "seg".
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, TimeLineID *tli_p,
+								int *file_p, WALOpenSegment *seg,
+								WALSegmentContext *segcxt);
+
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
 
@@ -232,6 +257,28 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+/*
+ * Error information that both backend and frontend caller can process.
+ *
+ * XXX Should the name be WALReadError? If so, we probably need to rename
+ * XLogRead() and XLogReadProcessError() too.
+ */
+typedef struct XLogReadError
+{
+	int			read_errno;		/* errno set by the last read(). */
+	int			readbytes;		/* Bytes read by the last read(). */
+	int			reqbytes;		/* Bytes requested to be read. */
+	WALOpenSegment *seg;		/* Segment we tried to read from. */
+} XLogReadError;
+
+extern XLogReadError *XLogRead(char *buf, XLogRecPtr startptr,
+							   Size count, TimeLineID *tli_p,
+							   WALOpenSegment *seg, WALSegmentContext *segcxt,
+							   WALSegmentOpen openSegment);
+#ifndef FRONTEND
+void		XLogReadProcessError(XLogReadError *errinfo);
+#endif
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.20.1

v07-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From cfc22166e79c9c68108e06f00a6c4125870d1c3a Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Thu, 26 Sep 2019 13:51:53 +0200
Subject: [PATCH 2/2] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 122 ----------------
 src/backend/replication/walsender.c    | 188 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 122 ----------------
 3 files changed, 432 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 09d42d3112..002b8f72a9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dcb84693fc..d7df72fb87 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -253,9 +253,6 @@ static void WalSndSegmentOpen(XLogSegNo nextSegNo, TimeLineID *tli_p,
 							  WALSegmentContext *segcxt);
 
 
-static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
-						XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2375,191 +2372,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index c97376101c..57dd7df56f 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -321,128 +321,6 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, TimeLineID *tli_p, int *file_p,
 					fname, strerror(errno));
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.20.1

#26

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#25)

Re: Attempt to consolidate reading of XLOG page

On 2019-Sep-26, Antonin Houska wrote:

One comment on the remaining part of the series:

Before this refactoring, the walsender.c:XLogRead() function contained these
lines

/*
* After reading into the buffer, check that what we read was valid. We do
* this after reading, because even though the segment was present when we
* opened it, it might get recycled or removed while we read it. The
* read() succeeds in that case, but the data we tried to read might
* already have been overwritten with new WAL records.
*/
XLByteToSeg(startptr, segno, segcxt->ws_segsize);
CheckXLogRemoved(segno, ThisTimeLineID);

but they don't fit into the new, generic implementation, so I copied these
lines to the two places right after the call of the new XLogRead(). However I
was not sure if ThisTimeLineID was ever correct here. It seems the original
walsender.c:XLogRead() implementation did not update ThisTimeLineID (and
therefore neither the new callback WalSndSegmentOpen() does), so both
logical_read_xlog_page() and XLogSendPhysical() could read the data from
another (historic) timeline. I think we should check the segment we really
read data from:

CheckXLogRemoved(segno, sendSeg->ws_tli);

Hmm, okay. I hope we can get rid of ThisTimeLineID one day.

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist. You need to allocate the
struct in the callers stacks and pass its address to XLogRead. XLogRead
can return NULL if everything's okay or the pointer to the errinfo
struct.

I've been wondering if it's really necessary to pass 'seg' to the
openSegment() callback. Only walsender wants that, and it seems ...
weird. Maybe that's not something for this patch series to fix, but it
would be good to find a more decent way to do the TLI switch at some
point.

+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "ws_segno" and "ws_off", however the XLOG
+			 * reader is not necessarily involved. Furthermore, we need to set
+			 * the current values for this function to work.
+			 */
+			seg->ws_segno = nextSegNo;
+			seg->ws_off = 0;

Why do we leave this responsibility to ReadPageInternal? Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal? (BTW "is
called by the XLOG reader" is a bit strange in code that appears in
xlogreader.c).

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#27

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#26)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Sep-26, Antonin Houska wrote:

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.

I was aware of this problem, therefore I defined the field as static:

+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+                WALOpenSegment *seg, WALSegmentContext *segcxt,
+                WALSegmentOpen openSegment)
+{
+       char       *p;
+       XLogRecPtr      recptr;
+       Size            nbytes;
+       static XLogReadError errinfo;

You need to allocate the struct in the callers stacks and pass its address
to XLogRead. XLogRead can return NULL if everything's okay or the pointer
to the errinfo struct.

I didn't choose this approach because that would add one more argument to the
function.

I've been wondering if it's really necessary to pass 'seg' to the
openSegment() callback. Only walsender wants that, and it seems ...
weird. Maybe that's not something for this patch series to fix, but it
would be good to find a more decent way to do the TLI switch at some
point.

Good point. Since walsender.c already has the "sendSeg" global variable, maybe
we can let WalSndSegmentOpen() use this one, and remove the "seg" argument
from the callback.

+			/*
+			 * If the function is called by the XLOG reader, the reader will
+			 * eventually set both "ws_segno" and "ws_off", however the XLOG
+			 * reader is not necessarily involved. Furthermore, we need to set
+			 * the current values for this function to work.
+			 */
+			seg->ws_segno = nextSegNo;
+			seg->ws_off = 0;
Why do we leave this responsibility to ReadPageInternal? Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal?

I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
what you suggest, we need make this responsibility documented. I'll consider
that.

(BTW "is called by the XLOG reader" is a bit strange in code that appears in
xlogreader.c).

ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
we'll eventually need this phrase in the comment at all.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#28

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Antonin Houska (#27)

Re: Attempt to consolidate reading of XLOG page

On 2019-Sep-27, Antonin Houska wrote:

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.

I was aware of this problem, therefore I defined the field as static:
+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+                WALOpenSegment *seg, WALSegmentContext *segcxt,
+                WALSegmentOpen openSegment)
+{
+       char       *p;
+       XLogRecPtr      recptr;
+       Size            nbytes;
+       static XLogReadError errinfo;

I see.

You need to allocate the struct in the callers stacks and pass its address
to XLogRead. XLogRead can return NULL if everything's okay or the pointer
to the errinfo struct.

I didn't choose this approach because that would add one more argument to the
function.

Yeah, the signature does seem a bit unwieldy. But I wonder if that's
too terrible a problem, considering that this code is incurring a bunch
of syscalls in the best case anyway.

BTW that tli_p business to the openSegment callback is horribly
inconsistent. Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline. This is made more complicated by us having the TLI in "seg"
also. Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo). As a very minor stylistic
point, we prefer to have out params at the end of the signature.

Why do we leave this responsibility to ReadPageInternal? Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal?

I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
what you suggest, we need make this responsibility documented. I'll consider
that.

Hmm. Thanks.

(BTW "is called by the XLOG reader" is a bit strange in code that appears in
xlogreader.c).

ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
we'll eventually need this phrase in the comment at all.

I think that would be slightly clearer. But if we can force this code
into actually making sense, that would be much better.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#29

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Alvaro Herrera (#28)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Sep-27, Antonin Houska wrote:

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.

I was aware of this problem, therefore I defined the field as static:

+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+                WALOpenSegment *seg, WALSegmentContext *segcxt,
+                WALSegmentOpen openSegment)
+{
+       char       *p;
+       XLogRecPtr      recptr;
+       Size            nbytes;
+       static XLogReadError errinfo;

I see.

That seems like an absolutely terrible "fix". We don't really want
XLogRead to be defined in a way that forces it to be non-reentrant do we?

regards, tom lane

#30

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#28)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

BTW that tli_p business to the openSegment callback is horribly
inconsistent. Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline. This is made more complicated by us having the TLI in "seg"
also. Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo). As a very minor stylistic
point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli. However that in turn implies that
XLogRead() does not need the "tli" argument at all.

Why do we leave this responsibility to ReadPageInternal? Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal?

I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
what you suggest, we need make this responsibility documented. I'll consider
that.

I think now we should not add any responsibility to XLogPageReadCB or its
subroutines because some extensions might already have their implementation of
XLogPageReadCB w/o XLogRead, and this change would break them.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#31

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Tom Lane (#29)

Re: Attempt to consolidate reading of XLOG page

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Sep-27, Antonin Houska wrote:

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.
I was aware of this problem, therefore I defined the field as static:
+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+                WALOpenSegment *seg, WALSegmentContext *segcxt,
+                WALSegmentOpen openSegment)
+{
+       char       *p;
+       XLogRecPtr      recptr;
+       Size            nbytes;
+       static XLogReadError errinfo;
I see.

That seems like an absolutely terrible "fix". We don't really want
XLogRead to be defined in a way that forces it to be non-reentrant do we?

Good point. I forgot that the XLOG reader can be used by frontends, so thread
safety is important here.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#32

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Antonin Houska (#30)

Re: Attempt to consolidate reading of XLOG page

Antonin Houska <ah@cybertec.at> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

BTW that tli_p business to the openSegment callback is horribly
inconsistent. Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline. This is made more complicated by us having the TLI in "seg"
also. Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo). As a very minor stylistic
point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Another approach is to add a boolean argument "check_tli", but that still
forces caller to pass some (random) value of the tli. The concept of
InvalidTimeLineID seems to me less disturbing than this.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#33

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Antonin Houska (#32)

Re: Attempt to consolidate reading of XLOG page

Hello,

At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos>

Antonin Houska <ah@cybertec.at> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

BTW that tli_p business to the openSegment callback is horribly
inconsistent. Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline. This is made more complicated by us having the TLI in "seg"
also. Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo). As a very minor stylistic
point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

openSegment represents the file *currently* opened. XLogRead
needs the TLI *to be* opened. If they are different, as far as
wal logical wal sender and pg_waldump is concerned, XLogRead
switches to the new TLI and the new TLI is set to
openSegment.ws_tli. So, it seems to me that the parameter
doesn't need to be inout? It is enough that it is an "in"
parameter.

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Physical wal sender doesn't switch TLI. So I don't think the
behavior doesn't harm (or doesn't fire). openSegment holds the
TLI set at the first call. (Even if future wal sender switches
TLI, the behavior should be needed.)

Another approach is to add a boolean argument "check_tli", but that still
forces caller to pass some (random) value of the tli. The concept of
InvalidTimeLineID seems to me less disturbing than this.

So I think InvalidTimeLineID is not needed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#34

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Kyotaro Horiguchi (#33)

Re: Attempt to consolidate reading of XLOG page

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos>

Antonin Houska <ah@cybertec.at> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

BTW that tli_p business to the openSegment callback is horribly
inconsistent. Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline. This is made more complicated by us having the TLI in "seg"
also. Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo). As a very minor stylistic
point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

openSegment represents the file *currently* opened.

I suppose you mean the "seg" argument.

XLogRead needs the TLI *to be* opened. If they are different, as far as wal
logical wal sender and pg_waldump is concerned, XLogRead switches to the new
TLI and the new TLI is set to openSegment.ws_tli.

Yes, it works in these cases.

So, it seems to me that the parameter doesn't need to be inout? It is enough
that it is an "in" parameter.

I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch
version. The reason I used pointer was the special meaning of the NULL value:
if NULL is passed, then the timeline should be ignored (because of the other
cases, see below).

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Physical wal sender doesn't switch TLI. So I don't think the
behavior doesn't harm (or doesn't fire). openSegment holds the
TLI set at the first call. (Even if future wal sender switches
TLI, the behavior should be needed.)

Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
introduced by the patch does have one. What should be passed for TLI to the
new implementation if it's called from walsender.c? If the check for a segment
change looks like this (here "tli" is the argument representing the desired
TLI)

if (seg->ws_file < 0 ||
!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;

/* Switch to another logfile segment */
if (seg->ws_file >= 0)
close(seg->ws_file);

then any valid TLI can result in accidental closing of the current segment
file. Since this is only refactoring patch, we should not allow such a change
of behavior even if it seems that the same segment will be reopened
immediately.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#35

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Antonin Houska (#34)

Re: Attempt to consolidate reading of XLOG page

At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos>

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

openSegment represents the file *currently* opened.

I suppose you mean the "seg" argument.

XLogRead needs the TLI *to be* opened. If they are different, as far as wal
logical wal sender and pg_waldump is concerned, XLogRead switches to the new
TLI and the new TLI is set to openSegment.ws_tli.

Yes, it works in these cases.

So, it seems to me that the parameter doesn't need to be inout? It is enough
that it is an "in" parameter.

I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch
version. The reason I used pointer was the special meaning of the NULL value:
if NULL is passed, then the timeline should be ignored (because of the other
cases, see below).

Understood.

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Physical wal sender doesn't switch TLI. So I don't think the
behavior doesn't harm (or doesn't fire). openSegment holds the
TLI set at the first call. (Even if future wal sender switches
TLI, the behavior should be needed.)

Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
introduced by the patch does have one. What should be passed for TLI to the
new implementation if it's called from walsender.c? If the check for a segment
change looks like this (here "tli" is the argument representing the desired
TLI)

TLI is mandatory to generate a wal file name so it must be passed
to the function anyways. In the current code it is sendTimeLine
for the walsender.c:XLogRead(). logical_read_xlog_page sets the
variable very time immediately before calling
XLogRead(). CreateReplicationSlot and StartReplication set the
variable to desired TLI immediately before calling and once it is
set by StartReplication, it is not changed by XLogSendPhysical
and wal sender ends at the end of the current timeline. In the
XLogRead, the value is copied to sendSeg->ws_tli when the file
for the new timeline is read.

if (seg->ws_file < 0 ||
!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;

/* Switch to another logfile segment */
if (seg->ws_file >= 0)
close(seg->ws_file);

then any valid TLI can result in accidental closing of the current segment
file. Since this is only refactoring patch, we should not allow such a change
of behavior even if it seems that the same segment will be reopened
immediately.

Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true
but seg->ws_file < 0 is also always true at the time. In other
words, the "tli != seg->ws_tli" is not even evaluated.

If wal sender had an open file (ws_file >= 0) and the new TLI is
different from ws_tli, it would be the sign of a serious bug.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#36

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Kyotaro Horiguchi (#35)

Re: Attempt to consolidate reading of XLOG page

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos>

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Physical wal sender doesn't switch TLI. So I don't think the
behavior doesn't harm (or doesn't fire). openSegment holds the
TLI set at the first call. (Even if future wal sender switches
TLI, the behavior should be needed.)

Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
introduced by the patch does have one. What should be passed for TLI to the
new implementation if it's called from walsender.c? I f the check for a segment
change looks like this (here "tli" is the argument representing the desired
TLI)

TLI is mandatory to generate a wal file name so it must be passed
to the function anyways. In the current code it is sendTimeLine
for the walsender.c:XLogRead(). logical_read_xlog_page sets the
variable very time immediately before calling
XLogRead(). CreateReplicationSlot and StartReplication set the
variable to desired TLI immediately before calling and once it is
set by StartReplication, it is not changed by XLogSendPhysical
and wal sender ends at the end of the current timeline. In the
XLogRead, the value is copied to sendSeg->ws_tli when the file
for the new timeline is read.

Are you saying that we should pass sendTimeLine to XLogRead()? I think it's
not always correct because sendSeg->ws_tli is sometimes assigned
sendTimeLineNextTLI, so the test "tli != seg->ws_tli" in

if (seg->ws_file < 0 ||
!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;

could pass occasionally.

Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true
but seg->ws_file < 0 is also always true at the time. In other
words, the "tli != seg->ws_tli" is not even evaluated.

If wal sender had an open file (ws_file >= 0) and the new TLI is
different from ws_tli, it would be the sign of a serious bug.

So we can probably pass ws_tli as the "new TLI" when calling the new
XLogRead() from walsender.c. Is that what you try to say? I need to think
about it more but it sounds like a good idea.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#37

Antonin Houska

ah@cybertec.at

over 6 years ago

In reply to: Alvaro Herrera (#28)

2 attachment(s)

Re: Attempt to consolidate reading of XLOG page

This is the next version.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v08-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 01f5cc8b0c1e6133e16242ec99957a78551058a7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 4 Oct 2019 12:07:22 +0200
Subject: [PATCH 1/2] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 141 ++++++++++++++++++++++-
 src/backend/access/transam/xlogutils.c  |  61 ++++++++--
 src/backend/replication/walsender.c     | 143 +++++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  60 +++++++++-
 src/include/access/xlogreader.h         |  42 +++++++
 5 files changed, 428 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c8b0d2303d..3e2167ca5a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -626,7 +629,13 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
 		goto err;
 
-	/* update read state information */
+	/*
+	 * Update read state information.
+	 *
+	 * Note that XLogRead(), if used, should have updated the "seg" too for
+	 * its own reasons, however we cannot rely on ->read_page() to call
+	 * XLogRead().
+	 */
 	state->seg.ws_segno = targetSegNo;
 	state->seg.ws_off = targetPageOff;
 	state->readLen = readLen;
@@ -1016,6 +1025,136 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to opens the next segment if needed and
+ * 'segcxt' is additional segment info that does not fit into 'seg'.
+ *
+ * 'errinfo' should point to XLogReadError structure which will receive error
+ * details in case the read fails.
+ *
+ * Returns true if succeeded, false if failed.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		 WALOpenSegment *seg, WALSegmentContext *segcxt,
+		 WALSegmentOpen openSegment, XLogReadError *errinfo)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			segbytes;
+		int			readbytes;
+
+		seg->ws_off = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			tli != seg->ws_tli)
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, segcxt, &tli, &seg->ws_file);
+
+			/* Update the current segment info. */
+			seg->ws_tli = tli;
+			seg->ws_segno = nextSegNo;
+			seg->ws_off = 0;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - seg->ws_off))
+			segbytes = segcxt->ws_segsize - seg->ws_off;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->ws_file, p, segbytes, seg->ws_off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo->read_errno = errno;
+			errinfo->readbytes = readbytes;
+			errinfo->reqbytes = segbytes;
+			errinfo->seg = seg;
+			return false;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+
+		/* Update the current segment info. */
+		seg->ws_off += readbytes;
+	}
+
+	return true;
+}
+
+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadProcessError(XLogReadError *errinfo)
+{
+	WALOpenSegment *seg = errinfo->seg;
+
+	if (errinfo->readbytes < 0)
+	{
+		errno = errinfo->read_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off, (Size) errinfo->reqbytes)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off,
+						(Size) errinfo->reqbytes)));
+	}
+}
+#endif
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..007974ea99 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,39 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+read_local_xlog_page_segment_open(XLogSegNo nextSegNo,
+								  WALSegmentContext *segcxt,
+								  TimeLineID *tli_p,
+								  int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -913,7 +946,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 				loc;
+	TimeLineID	tli;
 	int			count;
+	XLogReadError errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -932,7 +967,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		state->seg.ws_tli = ThisTimeLineID;
+		tli = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -982,14 +1017,14 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = state->currTLIValidUntil;
 
 			/*
-			 * Setting ws_tli to our wanted record's TLI is slightly wrong;
-			 * the page might begin on an older timeline if it contains a
-			 * timeline switch, since its xlog segment will have been copied
-			 * from the prior timeline. This is pretty harmless though, as
-			 * nothing cares so long as the timeline doesn't go backwards.  We
-			 * should read the page header instead; FIXME someday.
+			 * Setting tli to our wanted record's TLI is slightly wrong; the
+			 * page might begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will have been copied from the
+			 * prior timeline. This is pretty harmless though, as nothing
+			 * cares so long as the timeline doesn't go backwards.  We should
+			 * read the page header instead; FIXME someday.
 			 */
-			state->seg.ws_tli = state->currTLI;
+			tli = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1020,8 +1055,10 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				  &state->segcxt, read_local_xlog_page_segment_open,
+				  &errinfo))
+		XLogReadProcessError(&errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index eb4a98cc91..b21b143f99 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,9 +248,13 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
+static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p, int *file_p);
 
 
+static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
+						XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -766,6 +770,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	XLogReadError errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -786,7 +792,27 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page,
+				  targetPagePtr,
+				  XLOG_BLCKSZ,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		XLogReadProcessError(&errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2363,7 +2389,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2536,6 +2562,71 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				  TimeLineID *tli_p, int *file_p)
+{
+	char		path[MAXPGPATH];
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (sendSeg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	*file_p = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (*file_p < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(*tli_p, nextSegNo))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2553,6 +2644,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	XLogReadError errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2768,7 +2861,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len],
+				  startptr,
+				  nbytes,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		XLogReadProcessError(&errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index b79208cd73..4c49d1acf4 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -281,6 +281,46 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
+static void
+WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				   TimeLineID *tli_p, int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		*file_p = open_file_in_directory(segcxt->ws_dir, fname);
+		if (*file_p >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (*file_p < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -412,6 +452,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	XLogReadError errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -426,8 +467,23 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, private->timeline,
+				  &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	{
+		WALOpenSegment *seg = errinfo.seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->ws_off, (Size) errinfo.reqbytes,
+						strerror(errinfo.read_errno));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->ws_off, (Size) errinfo.reqbytes);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..f066b6255d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -218,6 +218,26 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 extern void XLogReaderFree(XLogReaderState *state);
 
 /* Initialize supporting structures */
+/*
+ * Callback to open the specified WAL segment for reading.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "segcxt" is additional information about the segment.
+ *
+ * "tli_p" is an input/output argument. XLogRead() uses it to pass the
+ * timeline in which the new segment should be found, but the callback can use
+ * it to return the TLI that it actually opened.
+ *
+ * "file_p" points to an address the segment file descriptor should be stored
+ * at.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+								TimeLineID *tli_p, int *file_p);
+
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
 
@@ -232,6 +252,28 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+/*
+ * Error information that both backend and frontend caller can process.
+ *
+ * XXX Should the name be WALReadError? If so, we probably need to rename
+ * XLogRead() and XLogReadProcessError() too.
+ */
+typedef struct XLogReadError
+{
+	int			read_errno;		/* errno set by the last read(). */
+	int			readbytes;		/* Bytes read by the last read(). */
+	int			reqbytes;		/* Bytes requested to be read. */
+	WALOpenSegment *seg;		/* Segment we tried to read from. */
+} XLogReadError;
+
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID tli, WALOpenSegment *seg,
+					 WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					 XLogReadError *errinfo);
+#ifndef FRONTEND
+void		XLogReadProcessError(XLogReadError *errinfo);
+#endif
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
-- 
2.20.1

v08-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From c7d4d390dc12978dd947dfe97691fbfbe1209e12 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 4 Oct 2019 12:07:22 +0200
Subject: [PATCH 2/2] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 122 ----------------
 src/backend/replication/walsender.c    | 188 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 122 ----------------
 3 files changed, 432 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 007974ea99..0211a68640 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b21b143f99..76a5477389 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,9 +252,6 @@ static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 							  TimeLineID *tli_p, int *file_p);
 
 
-static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
-						XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2377,191 +2374,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 4c49d1acf4..39686c235c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -321,128 +321,6 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 					fname, strerror(errno));
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.20.1

#38

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Antonin Houska (#37)

Re: Attempt to consolidate reading of XLOG page

On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:

This is the next version.

So... These are the two last bits to look at, reducing a bit the code
size:
5 files changed, 396 insertions(+), 419 deletions(-)

And what this patch set does is to refactor the routines we use now in
xlogreader.c to read a page by having new callbacks to open a segment,
as that's basically the only difference between the context of a WAL
sender, pg_waldump and recovery.

Here are some comments reading through the code.

+    * Note that XLogRead(), if used, should have updated the "seg" too for
+    * its own reasons, however we cannot rely on ->read_page() to call
+    * XLogRead().
Why?

Your patch removes all the three optional lseek() calls which can
happen in a segment. Am I missing something but isn't that plain
wrong? You could reuse the error context for that as well if an error
happens as what's needed is basically the segment name and the LSN
offset.

All the callers of XLogReadProcessError() are in src/backend/, so it
seems to me that there is no point to keep that in xlogreader.c but it
should be instead in xlogutils.c, no? It seems to me that this is
more like XLogGenerateError, or just XLogError(). We have been using
xlog as an acronym in many places of the code, so switching now to wal
just for the past matter of the pg_xlog -> pg_wal switch does not seem
worth bothering.

+read_local_xlog_page_segment_open(XLogSegNo nextSegNo,
+                                 WALSegmentContext *segcxt,
Could you think about a more simple name here?  It is a callback to
open a new segment, so it seems to me that we could call it just
open_segment_callback().  There is also no point in using a pointer to
the TLI, no?

+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to opens the next segment if needed and
+ * 'segcxt' is additional segment info that does not fit into 'seg'.
A typo and the last part of the last sentence could be better worded.
--
Michael

#39

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Michael Paquier (#38)

2 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:

This is the next version.

So... These are the two last bits to look at, reducing a bit the code
size:
5 files changed, 396 insertions(+), 419 deletions(-)

And what this patch set does is to refactor the routines we use now in
xlogreader.c to read a page by having new callbacks to open a segment,
as that's basically the only difference between the context of a WAL
sender, pg_waldump and recovery.

Here are some comments reading through the code.
+    * Note that XLogRead(), if used, should have updated the "seg" too for
+    * its own reasons, however we cannot rely on ->read_page() to call
+    * XLogRead().
Why?

I've updated the comment:

+       /*
+        * Update read state information.
+        *
+        * If XLogRead() is was called by ->read_page, it should have updated the
+        * ->seg fields accordingly (since we never request more than a single
+        * page, neither ws_segno nor ws_off should have advanced beyond
+        * targetSegNo and targetPageOff respectively). However it's not mandatory
+        * for ->read_page to call XLogRead().
+        */

Besides what I say here, I'm not sure if we should impose additional
requirement on the existing callbacks (possibly those in extensions) to update
the XLogReaderState.seg structure.

Your patch removes all the three optional lseek() calls which can
happen in a segment. Am I missing something but isn't that plain
wrong? You could reuse the error context for that as well if an error
happens as what's needed is basically the segment name and the LSN
offset.

Explicit call of lseek() is not used because XLogRead() uses pg_pread()
now. Nevertheless I found out that in the the last version of the patch I set
ws_off to 0 for a newly opened segment. This was wrong, fixed now.

All the callers of XLogReadProcessError() are in src/backend/, so it
seems to me that there is no point to keep that in xlogreader.c but it
should be instead in xlogutils.c, no? It seems to me that this is
more like XLogGenerateError, or just XLogError(). We have been using
xlog as an acronym in many places of the code, so switching now to wal
just for the past matter of the pg_xlog -> pg_wal switch does not seem
worth bothering.

ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the
"Read" word should be there because many other error can happen during XLOG
processing.

+read_local_xlog_page_segment_open(XLogSegNo nextSegNo,
+                                 WALSegmentContext *segcxt,
Could you think about a more simple name here?  It is a callback to
open a new segment, so it seems to me that we could call it just
open_segment_callback().

ok, the function is not exported to other modules, so there's no need to care
about uniqueness of the name. I chose wal_segment_open(), according to the
callback type WALSegmentOpen.

There is also no point in using a pointer to the TLI, no?

This particular callback makes no decision about the TLI, so it only uses
tli_p as an input argument.

+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to opens the next segment if needed and
+ * 'segcxt' is additional segment info that does not fit into 'seg'.
A typo and the last part of the last sentence could be better worded.

ok, adjusted a bit.

Thanks for review.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v09-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 01f5cc8b0c1e6133e16242ec99957a78551058a7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Nov 2019 12:07:22 +0200
Subject: [PATCH 1/2] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 6d8581919c..bbdb131525 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlogrecord.h"
 #include "access/xlog_internal.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -626,7 +629,15 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
 		goto err;
 
-	/* update read state information */
+	/*
+	 * Update read state information.
+	 *
+	 * If XLogRead() is was called by ->read_page, it should have updated the
+	 * ->seg fields accordingly (since we never request more than a single
+	 * page, neither ws_segno nor ws_off should have advanced beyond
+	 * targetSegNo and targetPageOff respectively). However it's not mandatory
+	 * for ->read_page to call XLogRead().
+	 */
 	state->seg.ws_segno = targetSegNo;
 	state->seg.ws_off = targetPageOff;
 	state->readLen = readLen;
@@ -1016,6 +1027,101 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to open the next segment and 'segcxt' is
+ * additional segment info that does not fit into 'seg'.
+ *
+ * 'errinfo' should point to XLogReadError structure which will receive error
+ * details in case the read fails.
+ *
+ * Returns true if succeeded, false if failed.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		 WALOpenSegment *seg, WALSegmentContext *segcxt,
+		 WALSegmentOpen openSegment, XLogReadError *errinfo)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		int			segbytes;
+		int			readbytes;
+
+		seg->ws_off = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			tli != seg->ws_tli)
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, segcxt, &tli, &seg->ws_file);
+
+			/* Update the current segment info. */
+			seg->ws_tli = tli;
+			seg->ws_segno = nextSegNo;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - seg->ws_off))
+			segbytes = segcxt->ws_segsize - seg->ws_off;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->ws_file, p, segbytes, seg->ws_off);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo->read_errno = errno;
+			errinfo->readbytes = readbytes;
+			errinfo->reqbytes = segbytes;
+			errinfo->seg = seg;
+			return false;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+
+	return true;
+}
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..31c0101d4b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,37 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				 TimeLineID *tli_p, int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -913,7 +944,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 				loc;
+	TimeLineID	tli;
 	int			count;
+	XLogReadError errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -932,7 +965,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		state->seg.ws_tli = ThisTimeLineID;
+		tli = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -982,14 +1015,14 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = state->currTLIValidUntil;
 
 			/*
-			 * Setting ws_tli to our wanted record's TLI is slightly wrong;
-			 * the page might begin on an older timeline if it contains a
-			 * timeline switch, since its xlog segment will have been copied
-			 * from the prior timeline. This is pretty harmless though, as
-			 * nothing cares so long as the timeline doesn't go backwards.  We
-			 * should read the page header instead; FIXME someday.
+			 * Setting tli to our wanted record's TLI is slightly wrong; the
+			 * page might begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will have been copied from the
+			 * prior timeline. This is pretty harmless though, as nothing
+			 * cares so long as the timeline doesn't go backwards.  We should
+			 * read the page header instead; FIXME someday.
 			 */
-			state->seg.ws_tli = state->currTLI;
+			tli = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1020,9 +1053,41 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				  &state->segcxt, wal_segment_open, &errinfo))
+		XLogReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
 }
+
+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadRaiseError(XLogReadError *errinfo)
+{
+	WALOpenSegment *seg = errinfo->seg;
+
+	if (errinfo->readbytes < 0)
+	{
+		errno = errinfo->read_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off, (Size) errinfo->reqbytes)));
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: length %zu",
+						XLogFileNameP(seg->ws_tli, seg->ws_segno),
+						seg->ws_off,
+						(Size) errinfo->reqbytes)));
+	}
+}
+#endif
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f5671504f..dd39fa9d17 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,9 +248,13 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
+static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p, int *file_p);
 
 
+static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
+						XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -766,6 +770,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	XLogReadError errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -786,7 +792,27 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page,
+				  targetPagePtr,
+				  XLOG_BLCKSZ,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		XLogReadRaiseError(&errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2362,7 +2388,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2535,6 +2561,71 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				  TimeLineID *tli_p, int *file_p)
+{
+	char		path[MAXPGPATH];
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (sendSeg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	*file_p = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (*file_p < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(*tli_p, nextSegNo))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2552,6 +2643,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	XLogReadError errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2767,7 +2860,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len],
+				  startptr,
+				  nbytes,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		XLogReadRaiseError(&errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1524e5eb1e..75142811b1 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -280,6 +280,46 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
+static void
+WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				   TimeLineID *tli_p, int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		*file_p = open_file_in_directory(segcxt->ws_dir, fname);
+		if (*file_p >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (*file_p < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -411,6 +451,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	XLogReadError errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -425,8 +466,23 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, private->timeline,
+				  &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	{
+		WALOpenSegment *seg = errinfo.seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errno != 0)
+			fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+						fname, seg->ws_off, (Size) errinfo.reqbytes,
+						strerror(errinfo.read_errno));
+		else
+			fatal_error("could not read from log file %s, offset %u: length: %zu",
+						fname, seg->ws_off, (Size) errinfo.reqbytes);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..f79f2c6980 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -218,6 +218,26 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 extern void XLogReaderFree(XLogReaderState *state);
 
 /* Initialize supporting structures */
+/*
+ * Callback to open the specified WAL segment for reading.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "segcxt" is additional information about the segment.
+ *
+ * "tli_p" is an input/output argument. XLogRead() uses it to pass the
+ * timeline in which the new segment should be found, but the callback can use
+ * it to return the TLI that it actually opened.
+ *
+ * "file_p" points to an address the segment file descriptor should be stored
+ * at.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+								TimeLineID *tli_p, int *file_p);
+
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
 
@@ -232,6 +252,25 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+/*
+ * Error information that both backend and frontend caller can process.
+ *
+ * XXX Should the name be WALReadError? If so, we probably need to rename
+ * XLogRead() and XLogReadProcessError() too.
+ */
+typedef struct XLogReadError
+{
+	int			read_errno;		/* errno set by the last read(). */
+	int			readbytes;		/* Bytes read by the last read(). */
+	int			reqbytes;		/* Bytes requested to be read. */
+	WALOpenSegment *seg;		/* Segment we tried to read from. */
+} XLogReadError;
+
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID tli, WALOpenSegment *seg,
+					 WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					 XLogReadError *errinfo);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 2df98e45b2..5c0cc910d7 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,4 +54,8 @@ extern int	read_local_xlog_page(XLogReaderState *state,
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
 
+
+#ifndef FRONTEND
+void		XLogReadRaiseError(XLogReadError *errinfo);
+#endif
 #endif

v09-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From c7d4d390dc12978dd947dfe97691fbfbe1209e12 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Fri, 11 Oct 2019 12:07:22 +0200
Subject: [PATCH 2/2] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 122 ----------------
 src/backend/replication/walsender.c    | 188 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 122 ----------------
 3 files changed, 432 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 007974ea99..0211a68640 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b21b143f99..76a5477389 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,9 +252,6 @@ static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 							  TimeLineID *tli_p, int *file_p);
 
 
-static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
-						XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2377,191 +2374,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 4c49d1acf4..39686c235c 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -321,128 +321,6 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 					fname, strerror(errno));
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.20.1

#40

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Antonin Houska (#39)

Re: Attempt to consolidate reading of XLOG page

On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote:

On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
+       /*
+        * Update read state information.
+        *
+        * If XLogRead() is was called by ->read_page, it should have updated the
+        * ->seg fields accordingly (since we never request more than a single
+        * page, neither ws_segno nor ws_off should have advanced beyond
+        * targetSegNo and targetPageOff respectively). However it's not mandatory
+        * for ->read_page to call XLogRead().
+        */
Besides what I say here, I'm not sure if we should impose additional
requirement on the existing callbacks (possibly those in extensions) to update
the XLogReaderState.seg structure.

"is was called" does not make sense in this sentence. Actually, I
would tend to just remove it completely.

Your patch removes all the three optional lseek() calls which can
happen in a segment. Am I missing something but isn't that plain
wrong? You could reuse the error context for that as well if an error
happens as what's needed is basically the segment name and the LSN
offset.

Explicit call of lseek() is not used because XLogRead() uses pg_pread()
now. Nevertheless I found out that in the the last version of the patch I set
ws_off to 0 for a newly opened segment. This was wrong, fixed now.

Missed that part, thanks. This was actually not obvious after an
initial lookup of the patch. Wouldn't it make sense to split that
part in a separate patch that we could review and get committed first
then? It would have the advantage to make the rest easier to review
and follow. And using pread is actually better for performance
compared to read+lseek. Now there is also the argument that we don't
always seek into an opened WAL segment, and that a plain read() is
actually better than pread() in some cases.

ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the
"Read" word should be there because many other error can happen during XLOG
processing.

No issue with this name.

ok, the function is not exported to other modules, so there's no need to care
about uniqueness of the name. I chose wal_segment_open(), according to the
callback type WALSegmentOpen.

Name is fine by me.

There is also no point in using a pointer to the TLI, no?

This particular callback makes no decision about the TLI, so it only uses
tli_p as an input argument.

Missed that walsender.c can enforce the tli to a new value, objection
withdrawn.

+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
I would remove that sentence.

+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadRaiseError(XLogReadError *errinfo)
No need for the FRONTEND ifndef's here as xlogutils.c is backend-only.

+#ifndef FRONTEND
+void       XLogReadRaiseError(XLogReadError *errinfo);
+#endif
Same as above, and missing an extern declaration.

+           fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+                       fname, seg->ws_off, (Size) errinfo.reqbytes,
+                       strerror(errinfo.read_errno));
Let's use this occasion to make those error messages more generic to
reduce the pain of translators as the file name lets us know that we
have to deal with a WAL segment.  Here are some suggestions, taking
into account the offset:
- If errno is set: "could not read file \"%s\" at offset %u: %m"
- For partial read: "could not read file \"%s\" at offset %u: read %d of %zu"
--
Michael

#41

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Michael Paquier (#40)

2 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote:

On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
Your patch removes all the three optional lseek() calls which can
happen in a segment. Am I missing something but isn't that plain
wrong? You could reuse the error context for that as well if an error
happens as what's needed is basically the segment name and the LSN
offset.

Explicit call of lseek() is not used because XLogRead() uses pg_pread()
now. Nevertheless I found out that in the the last version of the patch I set
ws_off to 0 for a newly opened segment. This was wrong, fixed now.

Missed that part, thanks. This was actually not obvious after an
initial lookup of the patch. Wouldn't it make sense to split that
part in a separate patch that we could review and get committed first
then? It would have the advantage to make the rest easier to review
and follow. And using pread is actually better for performance
compared to read+lseek. Now there is also the argument that we don't
always seek into an opened WAL segment, and that a plain read() is
actually better than pread() in some cases.

ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
read sequentially (i.e. without frequent seeks) is the reason pread() has't
been adopted so far.

The new version reflects your other suggestions too, except the one about not
renaming "XLOG" -> "WAL" (actually you mentioned that earlier in the
thread). I recall that when working on the preliminary patch (709d003fbd),
Alvaro suggested "WAL" for some structures because these are new. The rule
seemed to be that "XLOG..." should be left for the existing symbols, while the
new ones should be "WAL...":

/messages/by-id/20190917221521.GA15733@alvherre.pgsql

So I decided to rename the new symbols and to remove the related comment.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v10-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 34c0ae891de7dae3b84de33795c5d0521ccc0a88 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 12 Nov 2019 11:51:14 +0100
Subject: [PATCH 1/2] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 121 ++++++++++++++++++++
 src/backend/access/transam/xlogutils.c  |  94 ++++++++++++++--
 src/backend/replication/walsender.c     | 143 +++++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         |  63 ++++++++++-
 src/include/access/xlogreader.h         |  37 ++++++
 src/include/access/xlogutils.h          |   1 +
 6 files changed, 441 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7f24f0cb95..006c6298c9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogreader.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -1015,6 +1018,124 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to open the next segment and 'segcxt' is
+ * additional segment info that does not fit into 'seg'.
+ *
+ * 'errinfo' should point to XLogReadError structure which will receive error
+ * details in case the read fails.
+ *
+ * Returns true if succeeded, false if failed.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		 WALOpenSegment *seg, WALSegmentContext *segcxt,
+		 WALSegmentOpen openSegment, WALReadError *errinfo)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			tli != seg->ws_tli)
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/* Open the next segment in the caller's way. */
+			openSegment(nextSegNo, segcxt, &tli, &seg->ws_file);
+
+			/* Update the current segment info. */
+			seg->ws_tli = tli;
+			seg->ws_segno = nextSegNo;
+			seg->ws_off = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (seg->ws_off != startoff)
+		{
+			/*
+			 * Update ws_off unconditionally, it will be useful for error
+			 * message too.
+			 */
+			seg->ws_off = startoff;
+
+			if (lseek(seg->ws_file, (off_t) startoff, SEEK_SET) < 0)
+			{
+				errinfo->xlr_seek = true;
+				errinfo->xlr_errno = errno;
+				errinfo->xlr_req = 0;
+				errinfo->xlr_read = 0;
+				errinfo->xlr_seg = seg;
+				return false;
+			}
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - seg->ws_off))
+			segbytes = segcxt->ws_segsize - seg->ws_off;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = read(seg->ws_file, p, segbytes);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo->xlr_seek = false;
+			errinfo->xlr_errno = errno;
+			errinfo->xlr_req = segbytes;
+			errinfo->xlr_read = readbytes;
+			errinfo->xlr_seg = seg;
+			return false;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+
+	return true;
+}
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..18e436f4fd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,37 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+static void
+wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				 TimeLineID *tli_p, int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			file;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (file < 0)
+	{
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							path)));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+
+	*file_p = file;
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -913,7 +944,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 				loc;
+	TimeLineID	tli;
 	int			count;
+	WALReadError errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -932,7 +965,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		state->seg.ws_tli = ThisTimeLineID;
+		tli = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -982,14 +1015,14 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = state->currTLIValidUntil;
 
 			/*
-			 * Setting ws_tli to our wanted record's TLI is slightly wrong;
-			 * the page might begin on an older timeline if it contains a
-			 * timeline switch, since its xlog segment will have been copied
-			 * from the prior timeline. This is pretty harmless though, as
-			 * nothing cares so long as the timeline doesn't go backwards.  We
-			 * should read the page header instead; FIXME someday.
+			 * Setting tli to our wanted record's TLI is slightly wrong; the
+			 * page might begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will have been copied from the
+			 * prior timeline. This is pretty harmless though, as nothing
+			 * cares so long as the timeline doesn't go backwards.  We should
+			 * read the page header instead; FIXME someday.
 			 */
-			state->seg.ws_tli = state->currTLI;
+			tli = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1020,9 +1053,46 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				  &state->segcxt, wal_segment_open, &errinfo))
+		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
 }
+
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+WALReadRaiseError(WALReadError *errinfo)
+{
+	WALOpenSegment *seg = errinfo->xlr_seg;
+	char	   *fname = XLogFileNameP(seg->ws_tli, seg->ws_segno);
+
+	if (errinfo->xlr_seek)
+	{
+		errno = errinfo->xlr_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in log segment %s to offset %u: %m",
+						fname, seg->ws_off)));
+	}
+	else if (errinfo->xlr_read < 0)
+	{
+		errno = errinfo->xlr_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						fname, seg->ws_off, (Size) errinfo->xlr_req)));
+	}
+	else if (errinfo->xlr_read == 0)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+						fname, seg->ws_off, errinfo->xlr_read,
+						(Size) errinfo->xlr_req)));
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f5671504f..92a539aa5c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,9 +248,13 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
+static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p, int *file_p);
 
 
+static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
+						XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -766,6 +770,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	WALReadError errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -786,7 +792,27 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!XLogRead(cur_page,
+				  targetPagePtr,
+				  XLOG_BLCKSZ,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2362,7 +2388,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2535,6 +2561,71 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+void
+WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				  TimeLineID *tli_p, int *file_p)
+{
+	char		path[MAXPGPATH];
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (sendSeg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	*file_p = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (*file_p < 0)
+	{
+		/*
+		 * If the file is not found, assume it's because the standby asked for
+		 * a too old WAL segment that has already been removed or recycled.
+		 */
+		if (errno == ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("requested WAL segment %s has already been removed",
+							XLogFileNameP(*tli_p, nextSegNo))));
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2552,6 +2643,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	WALReadError errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2767,7 +2860,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!XLogRead(&output_message.data[output_message.len],
+				  startptr,
+				  nbytes,
+				  sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				  sendSeg,
+				  sendCxt,
+				  WalSndSegmentOpen,
+				  &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1524e5eb1e..c0c2590d56 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -280,6 +280,46 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
+static void
+WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				   TimeLineID *tli_p, int *file_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			tries;
+
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
+
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
+	{
+		*file_p = open_file_in_directory(segcxt->ws_dir, fname);
+		if (*file_p >= 0)
+			break;
+		if (errno == ENOENT)
+		{
+			int			save_errno = errno;
+
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
+			errno = save_errno;
+			continue;
+		}
+		/* Any other error, fall through and fail */
+		break;
+	}
+
+	if (*file_p < 0)
+		fatal_error("could not find file \"%s\": %s",
+					fname, strerror(errno));
+}
+
 /*
  * Read count bytes from a segment file in the specified directory, for the
  * given timeline, containing the specified record pointer; store the data in
@@ -411,6 +451,7 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	WALReadError errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -425,8 +466,26 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!XLogRead(readBuff, targetPagePtr, count, private->timeline,
+				  &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	{
+		WALOpenSegment *seg = errinfo.xlr_seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errinfo.xlr_seek)
+			fatal_error("could not seek in file %s to offset %u: %s",
+						fname, seg->ws_off, strerror(errinfo.xlr_errno));
+		else if (errinfo.xlr_errno != 0)
+			fatal_error("could not read in file %s, offset %u, length %zu: %s",
+						fname, seg->ws_off, (Size) errinfo.xlr_req,
+						strerror(errinfo.xlr_errno));
+		else
+			fatal_error("could not read in file %s, offset %u: length: %zu",
+						fname, seg->ws_off, (Size) errinfo.xlr_req);
+	}
 
 	return count;
 }
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..02d1514429 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -218,6 +218,26 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 extern void XLogReaderFree(XLogReaderState *state);
 
 /* Initialize supporting structures */
+/*
+ * Callback to open the specified WAL segment for reading.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "segcxt" is additional information about the segment.
+ *
+ * "tli_p" is an input/output argument. XLogRead() uses it to pass the
+ * timeline in which the new segment should be found, but the callback can use
+ * it to return the TLI that it actually opened.
+ *
+ * "file_p" points to an address the segment file descriptor should be stored
+ * at.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+								TimeLineID *tli_p, int *file_p);
+
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
 
@@ -232,6 +252,23 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+/*
+ * Error information that both backend and frontend caller can process.
+ */
+typedef struct WALReadError
+{
+	bool		xlr_seek;		/* Error during lseek() ? */
+	int			xlr_errno;		/* errno set by the last read() / lseek() */
+	int			xlr_read;		/* Bytes read by the last read(). */
+	int			xlr_req;		/* Bytes requested to be read. */
+	WALOpenSegment *xlr_seg;	/* Segment we tried to read from. */
+} WALReadError;
+
+extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
+					 TimeLineID tli, WALOpenSegment *seg,
+					 WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					 WALReadError *errinfo);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 2df98e45b2..1ffc765c6a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,4 +54,5 @@ extern int	read_local_xlog_page(XLogReaderState *state,
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
 
+void		WALReadRaiseError(WALReadError *errinfo);
 #endif
-- 
2.20.1

v10-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From 323f34e6193b8682c3ff5ffa5a1f1bf95cb36aa0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 12 Nov 2019 11:51:14 +0100
Subject: [PATCH 2/2] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 122 ----------------
 src/backend/replication/walsender.c    | 188 -------------------------
 src/bin/pg_waldump/pg_waldump.c        | 122 ----------------
 3 files changed, 432 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 18e436f4fd..98118f484e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92a539aa5c..80ef5eb909 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,9 +252,6 @@ static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 							  TimeLineID *tli_p, int *file_p);
 
 
-static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
-						XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2376,191 +2373,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index c0c2590d56..f1908a8868 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -320,128 +320,6 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 					fname, strerror(errno));
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * XLogReader read_page callback
  */
-- 
2.20.1

#42

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Antonin Houska (#41)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-12, Antonin Houska wrote:

ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
read sequentially (i.e. without frequent seeks) is the reason pread() has't
been adopted so far.

I don't quite understand why you backed off from switching to pread. It
seemed a good change to me.

Here's a few edits on top of your latest.

The new routine WALRead() is not at all the same as the previous
XLogRead, so I don't see why we would keep the name. Hence renamed.

I see no reason for the openSegment callback to return the FD in an out
param instead of straight return value. Changed that way.

Having seek/open be a boolean "xlr_seek" seems a bit weird. Changed to
an "operation" enum. (Maybe if we go back to pg_pread we can get rid of
this.) Accordingly, change WALReadRaiseError and WALDumpReadPage.

Change xlr_seg to be a struct rather than pointer to struct. It seems a
bit dangerous to me to return a pointer that we don't know is going to
be valid at raise-error time. Struct assignment works fine for the
purpose.

Renamed XLogDumpReadPage to WALDumpReadPage, because what the heck is
XLogDump anyway? That doesn't exist.

I would only like to switch this back to pg_pread() (from seek/read) and
I'd be happy to commit this.

What is logical_read_local_xlog_page all about? Seems useless. Let's
get rid of it.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#43

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Alvaro Herrera (#42)

1 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Attachment.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xlogread-extra.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 006c6298c9..e7a504c304 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -298,8 +298,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	 * byte to cover the whole record header, or at least the part of it that
 	 * fits on the same page.
 	 */
-	readOff = ReadPageInternal(state,
-							   targetPagePtr,
+	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff < 0)
 		goto err;
@@ -1019,23 +1018,22 @@ out:
 #endif							/* FRONTEND */
 
 /*
- * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
- * starting at location 'startptr'. 'seg' is the last segment used,
- * 'openSegment' is a callback to open the next segment and 'segcxt' is
- * additional segment info that does not fit into 'seg'.
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched from timeline 'tli'.
  *
- * 'errinfo' should point to XLogReadError structure which will receive error
- * details in case the read fails.
+ * 'seg/segcxt' identify the last segment used.  'openSegment' is a callback
+ * to open the next segment, if necessary.
  *
- * Returns true if succeeded, false if failed.
+ * Returns true if succeeded, false if an error occurs, in which case
+ * 'errinfo' receives error details.
  *
  * XXX probably this should be improved to suck data directly from the
  * WAL buffers when possible.
  */
 bool
-XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		 WALOpenSegment *seg, WALSegmentContext *segcxt,
-		 WALSegmentOpen openSegment, WALReadError *errinfo)
+WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		WALOpenSegment *seg, WALSegmentContext *segcxt,
+		WALSegmentOpen openSegment, WALReadError *errinfo)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -1066,7 +1064,7 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
 
 			/* Open the next segment in the caller's way. */
-			openSegment(nextSegNo, segcxt, &tli, &seg->ws_file);
+			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
 
 			/* Update the current segment info. */
 			seg->ws_tli = tli;
@@ -1085,11 +1083,11 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 
 			if (lseek(seg->ws_file, (off_t) startoff, SEEK_SET) < 0)
 			{
-				errinfo->xlr_seek = true;
+				errinfo->xlr_oper = WRE_OPER_SEEK;
 				errinfo->xlr_errno = errno;
 				errinfo->xlr_req = 0;
 				errinfo->xlr_read = 0;
-				errinfo->xlr_seg = seg;
+				errinfo->xlr_seg = *seg;
 				return false;
 			}
 		}
@@ -1119,11 +1117,11 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 
 		if (readbytes <= 0)
 		{
-			errinfo->xlr_seek = false;
+			errinfo->xlr_oper = WRE_OPER_OPEN;
 			errinfo->xlr_errno = errno;
 			errinfo->xlr_req = segbytes;
 			errinfo->xlr_read = readbytes;
-			errinfo->xlr_seg = seg;
+			errinfo->xlr_seg = *seg;
 			return false;
 		}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 98118f484e..4047be5a0b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -931,8 +931,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
-				  &state->segcxt, wal_segment_open, &errinfo))
+	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				 &state->segcxt, wal_segment_open, &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
@@ -941,36 +941,41 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 /*
  * Backend-specific convenience code to handle read errors encountered by
- * XLogRead().
+ * WALRead().
  */
 void
 WALReadRaiseError(WALReadError *errinfo)
 {
-	WALOpenSegment *seg = errinfo->xlr_seg;
+	WALOpenSegment *seg = &errinfo->xlr_seg;
 	char	   *fname = XLogFileNameP(seg->ws_tli, seg->ws_segno);
 
-	if (errinfo->xlr_seek)
+	switch (errinfo->xlr_oper)
 	{
-		errno = errinfo->xlr_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not seek in log segment %s to offset %u: %m",
-						fname, seg->ws_off)));
-	}
-	else if (errinfo->xlr_read < 0)
-	{
-		errno = errinfo->xlr_errno;
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-						fname, seg->ws_off, (Size) errinfo->xlr_req)));
-	}
-	else if (errinfo->xlr_read == 0)
-	{
-		ereport(ERROR,
-				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-						fname, seg->ws_off, errinfo->xlr_read,
-						(Size) errinfo->xlr_req)));
+		case WRE_OPER_SEEK:
+			errno = errinfo->xlr_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek in log segment %s to offset %u: %m",
+							fname, seg->ws_off)));
+			break;
+
+		case WRE_OPER_OPEN:
+			if (errinfo->xlr_read < 0)
+			{
+				errno = errinfo->xlr_errno;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+								fname, seg->ws_off, (Size) errinfo->xlr_req)));
+			}
+			else if (errinfo->xlr_read == 0)
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, seg->ws_off, errinfo->xlr_read,
+								(Size) errinfo->xlr_req)));
+			}
+			break;
 	}
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 80ef5eb909..668537a6fb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,8 +248,8 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
-							  TimeLineID *tli_p, int *file_p);
+static int WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -789,16 +789,16 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	if (!XLogRead(cur_page,
-				  targetPagePtr,
-				  XLOG_BLCKSZ,
-				  sendSeg->ws_tli,	/* Pass the current TLI because only
+	if (!WALRead(cur_page,
+				 targetPagePtr,
+				 XLOG_BLCKSZ,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
-				  sendSeg,
-				  sendCxt,
-				  WalSndSegmentOpen,
-				  &errinfo))
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/*
@@ -2376,11 +2376,12 @@ WalSndKill(int code, Datum arg)
 /*
  * Callback for XLogRead() to open the next segment.
  */
-void
+int
 WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
-				  TimeLineID *tli_p, int *file_p)
+				  TimeLineID *tli_p)
 {
 	char		path[MAXPGPATH];
+	int			fd;
 
 	/*-------
 	 * When reading from a historic timeline, and there is a timeline switch
@@ -2417,25 +2418,25 @@ WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 	}
 
 	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
-	*file_p = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
 
-	if (*file_p < 0)
-	{
-		/*
-		 * If the file is not found, assume it's because the standby asked for
-		 * a too old WAL segment that has already been removed or recycled.
-		 */
-		if (errno == ENOENT)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("requested WAL segment %s has already been removed",
-							XLogFileNameP(*tli_p, nextSegNo))));
-		else
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open file \"%s\": %m",
-							path)));
-	}
+	/*
+	 * If the file is not found, assume it's because the standby asked for
+	 * a too old WAL segment that has already been removed or recycled.
+	 */
+	if (errno == ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("requested WAL segment %s has already been removed",
+						XLogFileNameP(*tli_p, nextSegNo))));
+	else
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	return -1;					/* keep compiler quiet */
 }
 
 /*
@@ -2674,16 +2675,16 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
-	if (!XLogRead(&output_message.data[output_message.len],
-				  startptr,
-				  nbytes,
-				  sendSeg->ws_tli,	/* Pass the current TLI because only
+	if (!WALRead(&output_message.data[output_message.len],
+				 startptr,
+				 nbytes,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
-				  sendSeg,
-				  sendCxt,
-				  WalSndSegmentOpen,
-				  &errinfo))
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* See logical_read_xlog_page(). */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f1908a8868..cb71d33f48 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -280,12 +280,13 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
-static void
+static int
 WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
-				   TimeLineID *tli_p, int *file_p)
+				   TimeLineID *tli_p)
 {
 	TimeLineID	tli = *tli_p;
 	char		fname[MAXPGPATH];
+	int			fd;
 	int			tries;
 
 	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
@@ -298,9 +299,9 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 	 */
 	for (tries = 0; tries < 10; tries++)
 	{
-		*file_p = open_file_in_directory(segcxt->ws_dir, fname);
-		if (*file_p >= 0)
-			break;
+		fd = open_file_in_directory(segcxt->ws_dir, fname);
+		if (fd >= 0)
+			return fd;
 		if (errno == ENOENT)
 		{
 			int			save_errno = errno;
@@ -315,17 +316,16 @@ WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 		break;
 	}
 
-	if (*file_p < 0)
-		fatal_error("could not find file \"%s\": %s",
-					fname, strerror(errno));
+	fatal_error("could not find file \"%s\": %s", fname, strerror(errno));
+	return -1;					/* keep compiler quiet */
 }
 
 /*
  * XLogReader read_page callback
  */
 static int
-XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff)
+WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
+				XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -344,25 +344,31 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	if (!XLogRead(readBuff, targetPagePtr, count, private->timeline,
-				  &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	if (!WALRead(readBuff, targetPagePtr, count, private->timeline,
+				 &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
 	{
-		WALOpenSegment *seg = errinfo.xlr_seg;
+		WALOpenSegment *seg = &errinfo.xlr_seg;
 		char		fname[MAXPGPATH];
 
 		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
 					 state->segcxt.ws_segsize);
 
-		if (errinfo.xlr_seek)
-			fatal_error("could not seek in file %s to offset %u: %s",
-						fname, seg->ws_off, strerror(errinfo.xlr_errno));
-		else if (errinfo.xlr_errno != 0)
-			fatal_error("could not read in file %s, offset %u, length %zu: %s",
-						fname, seg->ws_off, (Size) errinfo.xlr_req,
-						strerror(errinfo.xlr_errno));
-		else
-			fatal_error("could not read in file %s, offset %u: length: %zu",
-						fname, seg->ws_off, (Size) errinfo.xlr_req);
+		switch (errinfo.xlr_oper)
+		{
+			case WRE_OPER_SEEK:
+				fatal_error("could not seek in file %s to offset %u: %s",
+							fname, seg->ws_off, strerror(errinfo.xlr_errno));
+				break;
+			case WRE_OPER_OPEN:
+				if (errinfo.xlr_errno != 0)
+					fatal_error("could not read in file %s, offset %u, length %zu: %s",
+								fname, seg->ws_off, (Size) errinfo.xlr_req,
+								strerror(errinfo.xlr_errno));
+				else
+					fatal_error("could not read in file %s, offset %u: length: %zu",
+								fname, seg->ws_off, (Size) errinfo.xlr_req);
+				break;
+		}
 	}
 
 	return count;
@@ -1026,7 +1032,7 @@ main(int argc, char **argv)
 	/* done with argument parsing, do the actual work */
 
 	/* we have everything we need, start reading */
-	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, XLogDumpReadPage,
+	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, WALDumpReadPage,
 										  &private);
 	if (!xlogreader_state)
 		fatal_error("out of memory");
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 02d1514429..6b51538a62 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -217,9 +217,9 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
-/* Initialize supporting structures */
 /*
- * Callback to open the specified WAL segment for reading.
+ * Callback to open the specified WAL segment for reading.  Returns a valid
+ * file descriptor when the file was opened successfully.
  *
  * "nextSegNo" is the number of the segment to be opened.
  *
@@ -229,15 +229,13 @@ extern void XLogReaderFree(XLogReaderState *state);
  * timeline in which the new segment should be found, but the callback can use
  * it to return the TLI that it actually opened.
  *
- * "file_p" points to an address the segment file descriptor should be stored
- * at.
- *
  * BasicOpenFile() is the preferred way to open the segment file in backend
  * code, whereas open(2) should be used in frontend.
  */
-typedef void (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
-								TimeLineID *tli_p, int *file_p);
+typedef int (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							   TimeLineID *tli_p);
 
+/* Initialize supporting structures */
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
 
@@ -252,22 +250,29 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
 /*
- * Error information that both backend and frontend caller can process.
+ * Error information from WALRead that both backend and frontend caller can
+ * process.
  */
+typedef enum
+{
+	WRE_OPER_OPEN = (1 << 0),
+	WRE_OPER_SEEK = (1 << 1)
+} WALReadErrorOper;
 typedef struct WALReadError
 {
-	bool		xlr_seek;		/* Error during lseek() ? */
+	WALReadErrorOper xlr_oper;	/* operation that caused the error */
 	int			xlr_errno;		/* errno set by the last read() / lseek() */
 	int			xlr_read;		/* Bytes read by the last read(). */
 	int			xlr_req;		/* Bytes requested to be read. */
-	WALOpenSegment *xlr_seg;	/* Segment we tried to read from. */
+	WALOpenSegment xlr_seg;		/* Segment we tried to read from. */
 } WALReadError;
 
-extern bool XLogRead(char *buf, XLogRecPtr startptr, Size count,
-					 TimeLineID tli, WALOpenSegment *seg,
-					 WALSegmentContext *segcxt, WALSegmentOpen openSegment,
-					 WALReadError *errinfo);
+extern bool WALRead(char *buf, XLogRecPtr startptr, Size count,
+					TimeLineID tli, WALOpenSegment *seg,
+					WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					WALReadError *errinfo);
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 1ffc765c6a..3fe5b36748 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -55,4 +55,5 @@ extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
 
 void		WALReadRaiseError(WALReadError *errinfo);
+
 #endif

#44

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Antonin Houska (#41)

Re: Attempt to consolidate reading of XLOG page

BTW ... contrib/test_decoding fails with this patch.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#45

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Alvaro Herrera (#42)

Re: Attempt to consolidate reading of XLOG page

On Fri, Nov 15, 2019 at 06:41:02PM -0300, Alvaro Herrera wrote:

I don't quite understand why you backed off from switching to pread. It
seemed a good change to me.

[...]

Having seek/open be a boolean "xlr_seek" seems a bit weird. Changed to
an "operation" enum. (Maybe if we go back to pg_pread we can get rid of
this.) Accordingly, change WALReadRaiseError and WALDumpReadPage.

This has been quickly mentioned on the thread which has introduced
pread():
/messages/by-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com

Now, read() > pread() > read()+lseek(), and we don't actually need to
seek into the file for all the cases where we read a WAL page. And on
a platform which uses the fallback implementation, this increases the
number of lseek() calls. I can see as you say that using it directly
in the refactoring can simplify the code.
--
Michael

#46

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Michael Paquier (#45)

Re: Attempt to consolidate reading of XLOG page

On Mon, Nov 18, 2019 at 09:29:03PM +0900, Michael Paquier wrote:

Now, read() > pread() > read()+lseek(), and we don't actually need to
seek into the file for all the cases where we read a WAL page. And on
a platform which uses the fallback implementation, this increases the
number of lseek() calls. I can see as you say that using it directly
in the refactoring can simplify the code.

Putting this point aside, here is the error coming from
contrib/test_decoding/, and this is independent of Alvaro's changes:
+ERROR: invalid magic number 0000 in log segment
000000010000000000000001, offset 6905856

I don't think that this is just xlp_magic messed up, the full page
read is full of zeros. But that's just a guess.

Looking at the code, I am spotting one inconsistency in the way
seg->ws_off is compiled after doing the read on the new version
compared to the three others. read() would move the offset of the
file, but the code is forgetting to increment it by a amount of
readbytes. Isn't that incorrect?

A second thing is that wal_segment_open() definition is incorrect in
xlogutils.c, generating a warning. The opened fd is the returned
result, and not an argument of the routine.

I am switching the patch as waiting on author. Antonin, could you
look at those problems?
--
Michael

#47

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Alvaro Herrera (#42)

2 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Nov-12, Antonin Houska wrote:

ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
read sequentially (i.e. without frequent seeks) is the reason pread() has't
been adopted so far.

I don't quite understand why you backed off from switching to pread. It
seemed a good change to me.

I agreed with Michael that it makes comparison of the old and new code more
difficult, and I also thought that his arguments about performance might be
worthwhile because WAL reading is mostly sequential and does not require many
seeks. However things appear to be more complex, see below.

Here's a few edits on top of your latest.

...

I agree with your renamings.

Change xlr_seg to be a struct rather than pointer to struct. It seems a
bit dangerous to me to return a pointer that we don't know is going to
be valid at raise-error time. Struct assignment works fine for the
purpose.

I would only like to switch this back to pg_pread() (from seek/read) and
I'd be happy to commit this.

I realized that, starting from commit 709d003fbd98b975a4fbcb4c5750fa6efaf9ad87
we use the WALOpenSegment.ws_off field incorrectly in
walsender.c:XLogRead(). In that commit we used this field to replace
XLogReaderState.readOff:

@@ -156,10 +165,9 @@ struct XLogReaderState
char *readBuf;
uint32 readLen;

-       /* last read segment, segment offset, TLI for data currently in readBuf */
-       XLogSegNo       readSegNo;
-       uint32          readOff;
-       TimeLineID      readPageTLI;
+       /* last read XLOG position for data currently in readBuf */
+       WALSegmentContext segcxt;
+       WALOpenSegment seg;

/*
* beginning of prior page read, and its TLI. Doesn't necessarily

Thus we cannot use it in XLogRead() to track the current position in the
segment file. Although walsender.c:XLogRead() misses this point, it's not
broken because walsender.c does not use XLogReaderState at all.

So if explicit lseek() should be used, another field should be added to
WALOpenSegment. I failed to do so when removing the pg_pread() call from the
patch, and that was the reason for the problem reported here:

/messages/by-id/20191117042221.GA16537@alvherre.pgsql
/messages/by-id/20191120083802.GB47145@paquier.xyz

Thus the use of pg_pread() makes the code quite a bit simpler, so I
re-introduced it. If you decide that an explicit lseek() should be used yet,
just let me know.

What is logical_read_local_xlog_page all about? Seems useless. Let's
get rid of it.

It seems so. Should I post a patch for that?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

v11-0005-Use-only-xlogreader.c-XLogRead.patchtext/x-diffDownload

From 52c044cb0f2f27c8714e5a8e41eb0e6da4714fd5 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 20 Nov 2019 15:11:48 +0100
Subject: [PATCH 1/2] Use only xlogreader.c:XLogRead()

The implementations in xlogutils.c and walsender.c are just renamed now, to be
removed by the following diff.
---
 src/backend/access/transam/xlogreader.c | 101 ++++++++++++++-
 src/backend/access/transam/xlogutils.c  |  85 ++++++++++--
 src/backend/replication/walsender.c     | 144 ++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c         | 164 +++++++-----------------
 src/include/access/xlogreader.h         |  36 ++++++
 src/include/access/xlogutils.h          |   2 +
 6 files changed, 399 insertions(+), 133 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7f24f0cb95..0742fcad7f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogreader.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -295,8 +298,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	 * byte to cover the whole record header, or at least the part of it that
 	 * fits on the same page.
 	 */
-	readOff = ReadPageInternal(state,
-							   targetPagePtr,
+	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff < 0)
 		goto err;
@@ -1015,6 +1017,101 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched from timeline 'tli'.
+ *
+ * 'seg/segcxt' identify the last segment used.  'openSegment' is a callback
+ * to open the next segment, if necessary.
+ *
+ * Returns true if succeeded, false if an error occurs, in which case
+ * 'errinfo' receives error details.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		WALOpenSegment *seg, WALSegmentContext *segcxt,
+		WALSegmentOpen openSegment, WALReadError *errinfo)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			tli != seg->ws_tli)
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/* Open the next segment in the caller's way. */
+			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
+
+			/* Update the current segment info. */
+			seg->ws_tli = tli;
+			seg->ws_segno = nextSegNo;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - startoff))
+			segbytes = segcxt->ws_segsize - startoff;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/*
+		 * Failure to read the data does not necessarily imply non-zero errno.
+		 * Set it to zero so that caller can distinguish the failure that does
+		 * not affect errno.
+		 */
+		errno = 0;
+
+		readbytes = pg_pread(seg->ws_file, p, segbytes, startoff);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo->wre_errno = errno;
+			errinfo->wre_req = segbytes;
+			errinfo->wre_read = readbytes;
+			errinfo->wre_seg = *seg;
+			return false;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+
+	return true;
+}
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..baca17260c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -653,8 +653,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * frontend).  Probably these should be merged at some point.
  */
 static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
+XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+			Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -896,6 +896,36 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for WALRead() to open the next segment.
+ */
+static int
+wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	if (errno == ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("requested WAL segment %s has already been removed",
+						path)));
+	else
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+
+	return -1;					/* keep compiler quiet */
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -913,7 +943,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 				loc;
+	TimeLineID	tli;
 	int			count;
+	WALReadError errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -932,7 +964,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		state->seg.ws_tli = ThisTimeLineID;
+		tli = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -982,14 +1014,14 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = state->currTLIValidUntil;
 
 			/*
-			 * Setting ws_tli to our wanted record's TLI is slightly wrong;
-			 * the page might begin on an older timeline if it contains a
-			 * timeline switch, since its xlog segment will have been copied
-			 * from the prior timeline. This is pretty harmless though, as
-			 * nothing cares so long as the timeline doesn't go backwards.  We
-			 * should read the page header instead; FIXME someday.
+			 * Setting tli to our wanted record's TLI is slightly wrong; the
+			 * page might begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will have been copied from the
+			 * prior timeline. This is pretty harmless though, as nothing
+			 * cares so long as the timeline doesn't go backwards.  We should
+			 * read the page header instead; FIXME someday.
 			 */
-			state->seg.ws_tli = state->currTLI;
+			tli = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1020,9 +1052,38 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				 &state->segcxt, wal_segment_open, &errinfo))
+		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
 }
+
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * WALRead().
+ */
+void
+WALReadRaiseError(WALReadError *errinfo)
+{
+	WALOpenSegment *seg = &errinfo->wre_seg;
+	char	   *fname = XLogFileNameP(seg->ws_tli, seg->ws_segno);
+
+	if (errinfo->wre_read < 0)
+	{
+		errno = errinfo->wre_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						fname, seg->ws_off, (Size) errinfo->wre_req)));
+	}
+	else if (errinfo->wre_read == 0)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+						fname, seg->ws_off, errinfo->wre_read,
+						(Size) errinfo->wre_req)));
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7f5671504f..11611072b0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,9 +248,13 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
+static int	WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p);
 
 
+static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
+						XLogRecPtr startptr, Size count);
+
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -766,6 +770,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	WALReadError errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -786,7 +792,27 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!WALRead(cur_page,
+				 targetPagePtr,
+				 XLOG_BLCKSZ,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2362,7 +2388,7 @@ WalSndKill(int code, Datum arg)
  * more than one.
  */
 static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -2535,6 +2561,72 @@ retry:
 	}
 }
 
+/*
+ * Callback for XLogRead() to open the next segment.
+ */
+int
+WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				  TimeLineID *tli_p)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
+	{
+		XLogSegNo	endSegNo;
+
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (sendSeg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
+	}
+
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	/*
+	 * If the file is not found, assume it's because the standby asked for a
+	 * too old WAL segment that has already been removed or recycled.
+	 */
+	if (errno == ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("requested WAL segment %s has already been removed",
+						XLogFileNameP(*tli_p, nextSegNo))));
+	else
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	return -1;					/* keep compiler quiet */
+}
+
 /*
  * Send out the WAL in its normal physical/stored form.
  *
@@ -2552,6 +2644,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	WALReadError errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2767,7 +2861,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!WALRead(&output_message.data[output_message.len],
+				 startptr,
+				 nbytes,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1524e5eb1e..d389df00b9 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -280,137 +280,56 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
+static int
+WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				   TimeLineID *tli_p)
 {
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			fd;
+	int			tries;
 
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
 
-	while (nbytes > 0)
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
 	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
-		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
+		fd = open_file_in_directory(segcxt->ws_dir, fname);
+		if (fd >= 0)
+			return fd;
+		if (errno == ENOENT)
 		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
 			int			save_errno = errno;
 
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-			errno = save_errno;
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
 
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
+			errno = save_errno;
+			continue;
 		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
+		/* Any other error, fall through and fail */
+		break;
 	}
+
+	fatal_error("could not find file \"%s\": %s", fname, strerror(errno));
+	return -1;					/* keep compiler quiet */
 }
 
 /*
  * XLogReader read_page callback
  */
 static int
-XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff)
+WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
+				XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	WALReadError errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -425,8 +344,23 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!WALRead(readBuff, targetPagePtr, count, private->timeline,
+				 &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	{
+		WALOpenSegment *seg = &errinfo.wre_seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errinfo.wre_errno != 0)
+			fatal_error("could not read in file %s, offset %u, length %zu: %s",
+						fname, seg->ws_off, (Size) errinfo.wre_req,
+						strerror(errinfo.wre_errno));
+		else
+			fatal_error("could not read in file %s, offset %u: length: %zu",
+						fname, seg->ws_off, (Size) errinfo.wre_req);
+	}
 
 	return count;
 }
@@ -1089,7 +1023,7 @@ main(int argc, char **argv)
 	/* done with argument parsing, do the actual work */
 
 	/* we have everything we need, start reading */
-	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, XLogDumpReadPage,
+	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, WALDumpReadPage,
 										  &private);
 	if (!xlogreader_state)
 		fatal_error("out of memory");
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..33273e7327 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -217,6 +217,24 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/*
+ * Callback to open the specified WAL segment for reading.  Returns a valid
+ * file descriptor when the file was opened successfully.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "segcxt" is additional information about the segment.
+ *
+ * "tli_p" is an input/output argument. XLogRead() uses it to pass the
+ * timeline in which the new segment should be found, but the callback can use
+ * it to return the TLI that it actually opened.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef int (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							   TimeLineID *tli_p);
+
 /* Initialize supporting structures */
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
@@ -232,6 +250,24 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
+/*
+ * Error information from WALRead that both backend and frontend caller can
+ * process.
+ */
+typedef struct WALReadError
+{
+	int			wre_errno;		/* errno set by the last read() / lseek() */
+	int			wre_read;		/* Bytes read by the last read(). */
+	int			wre_req;		/* Bytes requested to be read. */
+	WALOpenSegment wre_seg;		/* Segment we tried to read from. */
+} WALReadError;
+
+extern bool WALRead(char *buf, XLogRecPtr startptr, Size count,
+					TimeLineID tli, WALOpenSegment *seg,
+					WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					WALReadError *errinfo);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 2df98e45b2..3fe5b36748 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,4 +54,6 @@ extern int	read_local_xlog_page(XLogReaderState *state,
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
 
+void		WALReadRaiseError(WALReadError *errinfo);
+
 #endif
-- 
2.20.1

v11-0006-Remove-the-old-implemenations-of-XLogRead.patchtext/x-diffDownload

From ab0bc03480c0b299e1feca38ccfa4c1ec44c9c7c Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Wed, 20 Nov 2019 15:11:48 +0100
Subject: [PATCH 2/2] Remove the old implemenations of XLogRead().

Done in a separate patch because the diff looks harder to read if one function
(XLogRead) is removed and another one (the WALSegmentOpen callback) is added
nearby at the same time (the addition and removal of code can get mixed in the
diff).
---
 src/backend/access/transam/xlogutils.c | 122 ----------------
 src/backend/replication/walsender.c    | 188 -------------------------
 2 files changed, 310 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index baca17260c..a16d47f156 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogReadOld(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-			Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 11611072b0..d2426b0960 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,9 +252,6 @@ static int	WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
 							  TimeLineID *tli_p);
 
 
-static void XLogReadOld(WALSegmentContext *segcxt, char *buf,
-						XLogRecPtr startptr, Size count);
-
 /* Initialize walsender process before entering the main command loop */
 void
 InitWalSender(void)
@@ -2376,191 +2373,6 @@ WalSndKill(int code, Datum arg)
 	SpinLockRelease(&walsnd->mutex);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- */
-static void
-XLogReadOld(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
-
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
-
-	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
-	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
-}
-
 /*
  * Callback for XLogRead() to open the next segment.
  */
-- 
2.20.1

#48

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Antonin Houska (#47)

Re: Attempt to consolidate reading of XLOG page

On Wed, Nov 20, 2019 at 03:50:29PM +0100, Antonin Houska wrote:

Thus the use of pg_pread() makes the code quite a bit simpler, so I
re-introduced it. If you decide that an explicit lseek() should be used yet,
just let me know.

Skimming through the code, it looks like in a good state. The
failures of test_deconding are fixed, and all the changes from Alvaro
have been added.

+           fatal_error("could not read in file %s, offset %u, length %zu: %s",
+                       fname, seg->ws_off, (Size) errinfo.wre_req,
+                       strerror(errinfo.wre_errno));
You should be able to use %m here instead of strerror().

It seems to me that it is always important to not do changes
completely blindly either so as this does not become an issue for
recovery later on. FWIW, I ran a small set of tests with a WAL
segment sizes of 1MB and 1GB (fsync = off, max_wal_size/min_wal_size
set very high, 1 billion rows in single-column table followed by a
series of updates):
- Created a primary and a standby which archive_mode set.
- Stopped the standby.
- Produced close to 12GB worth of WAL.
- Restarted the standby with restore_command and compared the time it
takes for recovery to complete all the segments with HEAD and your
refactoring:
1GB + HEAD: 7min52s
1GB + patch: 8min10s
1MB + HEAD: 10min17s
1MB + patch: 12min1s

And with WAL segments at 1MB, I was seeing quite a slowdown with the
patch... Then I have done an extra test with pg_waldump with the
segments generated previously with the output redirected to /dev/null.
Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
and 15.851s with the patch.
--
Michael

#49

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Michael Paquier (#48)

Re: Attempt to consolidate reading of XLOG page

On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:

And with WAL segments at 1MB, I was seeing quite a slowdown with the
patch... Then I have done an extra test with pg_waldump with the
segments generated previously with the output redirected to /dev/null.
Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
and 15.851s with the patch.

Here are more tests with pg_waldump and 1MB/1GB segment sizes with
records generated from pgbench, (7 runs, eliminated the two highest
and two lowest, these are the remaining 3 runs as real time):
1) 1MB segment size, 512 segments:
time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null
- HEAD: 0m4.512s, 0m4.446s, 0m4.501s
- Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
- Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
2) 1GB segment size, 3 segments:
time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null
- HEAD: 0m11.802s, 0m11.834s, 0m11.846s
- Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
- Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
So there is a tendency for a small slowdown here. Still it is not
that much, so I withdraw my concerns.

Another thing:
+void WALReadRaiseError(WALReadError *errinfo);
This is missing an "extern" declaration.

Alvaro, you are marked as a committer of this CF entry. Are you
planning to look at it again? Sorry for the delay from my side.
--
Michael

#50

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Michael Paquier (#49)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-22, Michael Paquier wrote:

Alvaro, you are marked as a committer of this CF entry. Are you
planning to look at it again? Sorry for the delay from my side.

Yes :-) hopefully next week. Thanks for reviewing.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#51

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Alvaro Herrera (#50)

Re: Attempt to consolidate reading of XLOG page

On Fri, Nov 22, 2019 at 12:49:33AM -0300, Alvaro Herrera wrote:

Yes :-) hopefully next week. Thanks for reviewing.

Thanks, I am switching the entry as ready for committer then. Please
note that the latest patch series have a conflict at the top of
walsender.c easy enough to resolve, and that the function declaration
in xlogutils.h misses an "extern". I personally find unnecessary the
last sentence in the new comment block of xlogreader.h to describe the
new callback to open a segment about BasicOpenFile() and open()
because one could also use a transient file opened in the backend, but
I'll be fine with anything you think is most fit. That's a minor
point.

Thanks Antonin for doing the refactoring effort.
--
Michael

#52

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Michael Paquier (#49)

Re: Attempt to consolidate reading of XLOG page

Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:

And with WAL segments at 1MB, I was seeing quite a slowdown with the
patch... Then I have done an extra test with pg_waldump with the
segments generated previously with the output redirected to /dev/null.
Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
and 15.851s with the patch.

Here are more tests with pg_waldump and 1MB/1GB segment sizes with
records generated from pgbench, (7 runs, eliminated the two highest
and two lowest, these are the remaining 3 runs as real time):
1) 1MB segment size, 512 segments:
time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null
- HEAD: 0m4.512s, 0m4.446s, 0m4.501s
- Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
- Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
2) 1GB segment size, 3 segments:
time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null
- HEAD: 0m11.802s, 0m11.834s, 0m11.846s
- Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
- Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
So there is a tendency for a small slowdown here. Still it is not
that much, so I withdraw my concerns.

Thanks for the testing!

I thought that in [1]/messages/by-id/20191121080550.GG153437@paquier.xyz you try discourage me from using pg_pread(), but now it
seems to be the opposite. Ideally I'd like to see no overhead added by my
patch at all, but the code simplicity should matter too.

As a clue, we can perhaps consider the fact that commit c24dcd0c removed
explicit lseek() also from XLogWrite(), but I'm not sure how much we can
compare XLOG writing and reading (I'd expect writing to be a bit less
sequential than reading because XLogWrite() may need to write the last page
more than once.)

Let's wait for Alvaro's judgement.

Another thing:
+void WALReadRaiseError(WALReadError *errinfo);
This is missing an "extern" declaration.

I'll fix this as well as the other problem reported in [1]/messages/by-id/20191121080550.GG153437@paquier.xyz as soon as I know
whether pg_pread() should be used or not.

[1]: /messages/by-id/20191121080550.GG153437@paquier.xyz

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#53

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Antonin Houska (#52)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-22, Antonin Houska wrote:

I thought that in [1] you try discourage me from using pg_pread(), but now it
seems to be the opposite. Ideally I'd like to see no overhead added by my
patch at all, but the code simplicity should matter too.

FWIW I think the new code is buggy because it doesn't seem to be setting
ws_off, so I suppose the optimization in ReadPageInternal to skip
reading the page when it's already the page we have is not hit, except
for the first page in the segment. I didn't verify this, just my
impression while reading the code.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#54

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Alvaro Herrera (#53)

Re: Attempt to consolidate reading of XLOG page

On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:

FWIW I think the new code is buggy because it doesn't seem to be setting
ws_off, so I suppose the optimization in ReadPageInternal to skip
reading the page when it's already the page we have is not hit, except
for the first page in the segment. I didn't verify this, just my
impression while reading the code.

FWIW, this matches with my impression here, third paragraph:
/messages/by-id/20191120083802.GB47145@paquier.xyz
--
Michael

#55

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Michael Paquier (#54)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-22, Michael Paquier wrote:

On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:

FWIW I think the new code is buggy because it doesn't seem to be setting
ws_off, so I suppose the optimization in ReadPageInternal to skip
reading the page when it's already the page we have is not hit, except
for the first page in the segment. I didn't verify this, just my
impression while reading the code.

FWIW, this matches with my impression here, third paragraph:
/messages/by-id/20191120083802.GB47145@paquier.xyz

Ah, right.

I was wondering if we shouldn't do away with the concept of "offset" as
such, since the offset there is always forcibly set to the start of a
page. Why don't we count page numbers instead? It seems like the
interface is confusingly generic (measure in bytes) yet not offer any
extra functionality that could not be obtained with a simpler struct
repr (measure in pages).

But then that's not something that we need to change in this patch.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#56

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Alvaro Herrera (#55)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Nov-22, Michael Paquier wrote:

On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:

FWIW I think the new code is buggy because it doesn't seem to be setting
ws_off, so I suppose the optimization in ReadPageInternal to skip
reading the page when it's already the page we have is not hit, except
for the first page in the segment. I didn't verify this, just my
impression while reading the code.

FWIW, this matches with my impression here, third paragraph:
/messages/by-id/20191120083802.GB47145@paquier.xyz

Ah, right.

As I pointed out in

/messages/by-id/88183.1574261429@antos

seg.ws_off only replaced readOff in XLogReaderState. So we should only update
ws_off where readOff was updated before commit 709d003. This does happen in
ReadPageInternal (see HEAD) and I see no reason for the final patch to update
ws_off anywhere else.

I was wondering if we shouldn't do away with the concept of "offset" as
such, since the offset there is always forcibly set to the start of a
page. Why don't we count page numbers instead? It seems like the
interface is confusingly generic (measure in bytes) yet not offer any
extra functionality that could not be obtained with a simpler struct
repr (measure in pages).

Yes, I agree that page numbers would be sufficient.

But then that's not something that we need to change in this patch.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#57

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Antonin Houska (#56)

1 attachment(s)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-22, Antonin Houska wrote:

As I pointed out in

/messages/by-id/88183.1574261429@antos

seg.ws_off only replaced readOff in XLogReaderState. So we should only update
ws_off where readOff was updated before commit 709d003. This does happen in
ReadPageInternal (see HEAD) and I see no reason for the final patch to update
ws_off anywhere else.

Oh you're right.

I see no reason to leave ws_off. We can move that to XLogReaderState; I
did that here. We also need the offset in WALReadError, though, so I
added it there too. Conceptually it seems clearer to me this way.

What do you think of the attached?

BTW I'm not clear what errors can pread()/pg_pread() report that do not
set errno. I think lines 1083/1084 of WALRead are spurious now.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v12-0001-Use-only-xlogreader.c-WALRead.patchtext/x-diff; charset=us-asciiDownload

From 41d9a060db397aacb2b3f122204c0b4fbb51a109 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Tue, 12 Nov 2019 11:51:14 +0100
Subject: [PATCH v12] Use only xlogreader.c:WALRead()

---
 src/backend/access/transam/xlogreader.c | 104 ++++++++-
 src/backend/access/transam/xlogutils.c  | 207 ++++++----------
 src/backend/replication/walsender.c     | 298 ++++++++++--------------
 src/bin/pg_waldump/pg_waldump.c         | 164 ++++---------
 src/include/access/xlogreader.h         |  39 +++-
 src/include/access/xlogutils.h          |   2 +
 6 files changed, 383 insertions(+), 431 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7f24f0cb95..3692c46f43 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogreader.h"
@@ -27,6 +29,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "utils/memutils.h"
 #endif
 
@@ -208,7 +211,6 @@ WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 {
 	seg->ws_file = -1;
 	seg->ws_segno = 0;
-	seg->ws_off = 0;
 	seg->ws_tli = 0;
 
 	segcxt->ws_segsize = segsize;
@@ -295,8 +297,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	 * byte to cover the whole record header, or at least the part of it that
 	 * fits on the same page.
 	 */
-	readOff = ReadPageInternal(state,
-							   targetPagePtr,
+	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff < 0)
 		goto err;
@@ -556,7 +557,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	/* check whether we have all the requested data already */
 	if (targetSegNo == state->seg.ws_segno &&
-		targetPageOff == state->seg.ws_off && reqLen <= state->readLen)
+		targetPageOff == state->segoff && reqLen <= state->readLen)
 		return state->readLen;
 
 	/*
@@ -627,7 +628,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 	/* update read state information */
 	state->seg.ws_segno = targetSegNo;
-	state->seg.ws_off = targetPageOff;
+	state->segoff = targetPageOff;
 	state->readLen = readLen;
 
 	return readLen;
@@ -644,7 +645,7 @@ static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
 	state->seg.ws_segno = 0;
-	state->seg.ws_off = 0;
+	state->segoff = 0;
 	state->readLen = 0;
 }
 
@@ -1015,6 +1016,97 @@ out:
 
 #endif							/* FRONTEND */
 
+/*
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched from timeline 'tli'.
+ *
+ * 'seg/segcxt' identify the last segment used.  'openSegment' is a callback
+ * to open the next segment, if necessary.
+ *
+ * Returns true if succeeded, false if an error occurs, in which case
+ * 'errinfo' receives error details.
+ *
+ * XXX probably this should be improved to suck data directly from the
+ * WAL buffers when possible.
+ */
+bool
+WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
+		WALOpenSegment *seg, WALSegmentContext *segcxt,
+		WALSegmentOpen openSegment, WALReadError *errinfo)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
+
+		if (seg->ws_file < 0 ||
+			!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
+			tli != seg->ws_tli)
+		{
+			XLogSegNo	nextSegNo;
+
+			/* Switch to another logfile segment */
+			if (seg->ws_file >= 0)
+				close(seg->ws_file);
+
+			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
+
+			/* Open the next segment in the caller's way. */
+			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
+
+			/* Update the current segment info. */
+			seg->ws_tli = tli;
+			seg->ws_segno = nextSegNo;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (segcxt->ws_segsize - startoff))
+			segbytes = segcxt->ws_segsize - startoff;
+		else
+			segbytes = nbytes;
+
+#ifndef FRONTEND
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+#endif
+
+		/* Reset errno first; eases reporting non-errno-affecting errors */
+		errno = 0;
+		readbytes = pg_pread(seg->ws_file, p, segbytes, (off_t) startoff);
+
+#ifndef FRONTEND
+		pgstat_report_wait_end();
+#endif
+
+		if (readbytes <= 0)
+		{
+			errinfo->wre_errno = errno;
+			errinfo->wre_req = segbytes;
+			errinfo->wre_read = readbytes;
+			errinfo->wre_off = startoff;
+			errinfo->wre_seg = *seg;
+			return false;
+		}
+
+		/* Update state for read */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+
+	return true;
+}
+
 /* ----------------------------------------
  * Functions for decoding the data and block references in a record.
  * ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..9836b6446a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -639,128 +639,6 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 	forget_invalid_pages(rnode, forkNum, nblocks);
 }
 
-/*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- * in timeline 'tli'.
- *
- * Will open, and keep open, one WAL segment stored in the static file
- * descriptor 'sendFile'. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
- *
- * XXX This is very similar to pg_waldump's XLogDumpXLogRead and to XLogRead
- * in walsender.c but for small differences (such as lack of elog() in
- * frontend).  Probably these should be merged at some point.
- */
-static void
-XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
-		 Size count)
-{
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-
-	/* state maintained across calls */
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static TimeLineID sendTLI = 0;
-	static uint32 sendOff = 0;
-
-	Assert(segsize == wal_segment_size);
-
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
-	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, segsize);
-
-		/* Do we need to switch to a different xlog segment? */
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, segsize) ||
-			sendTLI != tli)
-		{
-			char		path[MAXPGPATH];
-
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, segsize);
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-
-			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-
-			if (sendFile < 0)
-			{
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									path)));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendOff = 0;
-			sendTLI = tli;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				char		path[MAXPGPATH];
-				int			save_errno = errno;
-
-				XLogFilePath(path, tli, sendSegNo, segsize);
-				errno = save_errno;
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								path, startoff)));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segsize - startoff))
-			segbytes = segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendFile, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes <= 0)
-		{
-			char		path[MAXPGPATH];
-			int			save_errno = errno;
-
-			XLogFilePath(path, tli, sendSegNo, segsize);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %lu: %m",
-							path, sendOff, (unsigned long) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
-	}
-}
-
 /*
  * Determine which timeline to read an xlog page from and set the
  * XLogReaderState's currTLI to that timeline ID.
@@ -802,8 +680,8 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->seg.ws_segno *
-	state->segcxt.ws_segsize + state->seg.ws_off;
+	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
+									 state->segcxt.ws_segsize + state->segoff);
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -896,6 +774,36 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/*
+ * Callback for WALRead() to open the next segment.
+ */
+static int
+wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	if (errno == ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("requested WAL segment %s has already been removed",
+						path)));
+	else
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+
+	return -1;					/* keep compiler quiet */
+}
+
 /*
  * read_page callback for reading local xlog files
  *
@@ -913,7 +821,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 {
 	XLogRecPtr	read_upto,
 				loc;
+	TimeLineID	tli;
 	int			count;
+	WALReadError errinfo;
 
 	loc = targetPagePtr + reqLen;
 
@@ -932,7 +842,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		state->seg.ws_tli = ThisTimeLineID;
+		tli = ThisTimeLineID;
 
 		/*
 		 * Check which timeline to get the record from.
@@ -982,14 +892,14 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			read_upto = state->currTLIValidUntil;
 
 			/*
-			 * Setting ws_tli to our wanted record's TLI is slightly wrong;
-			 * the page might begin on an older timeline if it contains a
-			 * timeline switch, since its xlog segment will have been copied
-			 * from the prior timeline. This is pretty harmless though, as
-			 * nothing cares so long as the timeline doesn't go backwards.  We
-			 * should read the page header instead; FIXME someday.
+			 * Setting tli to our wanted record's TLI is slightly wrong; the
+			 * page might begin on an older timeline if it contains a timeline
+			 * switch, since its xlog segment will have been copied from the
+			 * prior timeline. This is pretty harmless though, as nothing
+			 * cares so long as the timeline doesn't go backwards.  We should
+			 * read the page header instead; FIXME someday.
 			 */
-			state->seg.ws_tli = state->currTLI;
+			tli = state->currTLI;
 
 			/* No need to wait on a historical timeline */
 			break;
@@ -1020,9 +930,38 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
-			 XLOG_BLCKSZ);
+	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
+				 &state->segcxt, wal_segment_open, &errinfo))
+		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
 	return count;
 }
+
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * WALRead().
+ */
+void
+WALReadRaiseError(WALReadError *errinfo)
+{
+	WALOpenSegment *seg = &errinfo->wre_seg;
+	char	   *fname = XLogFileNameP(seg->ws_tli, seg->ws_segno);
+
+	if (errinfo->wre_read < 0)
+	{
+		errno = errinfo->wre_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
+						fname, errinfo->wre_off, (Size) errinfo->wre_req)));
+	}
+	else if (errinfo->wre_read == 0)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+						fname, errinfo->wre_off, errinfo->wre_read,
+						(Size) errinfo->wre_req)));
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fa75872877..c124f59435 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -248,8 +248,9 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
 static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
+static int	WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							  TimeLineID *tli_p);
 static void UpdateSpillStats(LogicalDecodingContext *ctx);
-static void XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count);
 
 
 /* Initialize walsender process before entering the main command loop */
@@ -767,6 +768,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 {
 	XLogRecPtr	flushptr;
 	int			count;
+	WALReadError errinfo;
+	XLogSegNo	segno;
 
 	XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
 	sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
@@ -787,7 +790,27 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	XLogRead(sendCxt, cur_page, targetPagePtr, XLOG_BLCKSZ);
+	if (!WALRead(cur_page,
+				 targetPagePtr,
+				 XLOG_BLCKSZ,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/*
+	 * After reading into the buffer, check that what we read was valid. We do
+	 * this after reading, because even though the segment was present when we
+	 * opened it, it might get recycled or removed while we read it. The
+	 * read() succeeds in that case, but the data we tried to read might
+	 * already have been overwritten with new WAL records.
+	 */
+	XLByteToSeg(targetPagePtr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
 
 	return count;
 }
@@ -2361,188 +2384,69 @@ WalSndKill(int code, Datum arg)
 }
 
 /*
- * Read 'count' bytes from WAL into 'buf', starting at location 'startptr'
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
- *
- * Will open, and keep open, one WAL segment stored in the global file
- * descriptor sendFile. This means if XLogRead is used once, there will
- * always be one descriptor left open until the process ends, but never
- * more than one.
+ * Callback for XLogRead() to open the next segment.
  */
-static void
-XLogRead(WALSegmentContext *segcxt, char *buf, XLogRecPtr startptr, Size count)
+int
+WalSndSegmentOpen(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				  TimeLineID *tli_p)
 {
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
-	XLogSegNo	segno;
+	char		path[MAXPGPATH];
+	int			fd;
 
-retry:
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
+	/*-------
+	 * When reading from a historic timeline, and there is a timeline switch
+	 * within this segment, read from the WAL segment belonging to the new
+	 * timeline.
+	 *
+	 * For example, imagine that this server is currently on timeline 5, and
+	 * we're streaming timeline 4. The switch from timeline 4 to 5 happened at
+	 * 0/13002088. In pg_wal, we have these files:
+	 *
+	 * ...
+	 * 000000040000000000000012
+	 * 000000040000000000000013
+	 * 000000050000000000000013
+	 * 000000050000000000000014
+	 * ...
+	 *
+	 * In this situation, when requested to send the WAL from segment 0x13, on
+	 * timeline 4, we read the WAL from file 000000050000000000000013. Archive
+	 * recovery prefers files from newer timelines, so if the segment was
+	 * restored from the archive on this server, the file belonging to the old
+	 * timeline, 000000040000000000000013, might not exist. Their contents are
+	 * equal up to the switchpoint, because at a timeline switch, the used
+	 * portion of the old segment is copied to the new file.  -------
+	 */
+	*tli_p = sendTimeLine;
+	if (sendTimeLineIsHistoric)
 	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
+		XLogSegNo	endSegNo;
 
-		startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
-
-		if (sendSeg->ws_file < 0 ||
-			!XLByteInSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize))
-		{
-			char		path[MAXPGPATH];
-
-			/* Switch to another logfile segment */
-			if (sendSeg->ws_file >= 0)
-				close(sendSeg->ws_file);
-
-			XLByteToSeg(recptr, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			/*-------
-			 * When reading from a historic timeline, and there is a timeline
-			 * switch within this segment, read from the WAL segment belonging
-			 * to the new timeline.
-			 *
-			 * For example, imagine that this server is currently on timeline
-			 * 5, and we're streaming timeline 4. The switch from timeline 4
-			 * to 5 happened at 0/13002088. In pg_wal, we have these files:
-			 *
-			 * ...
-			 * 000000040000000000000012
-			 * 000000040000000000000013
-			 * 000000050000000000000013
-			 * 000000050000000000000014
-			 * ...
-			 *
-			 * In this situation, when requested to send the WAL from
-			 * segment 0x13, on timeline 4, we read the WAL from file
-			 * 000000050000000000000013. Archive recovery prefers files from
-			 * newer timelines, so if the segment was restored from the
-			 * archive on this server, the file belonging to the old timeline,
-			 * 000000040000000000000013, might not exist. Their contents are
-			 * equal up to the switchpoint, because at a timeline switch, the
-			 * used portion of the old segment is copied to the new file.
-			 *-------
-			 */
-			sendSeg->ws_tli = sendTimeLine;
-			if (sendTimeLineIsHistoric)
-			{
-				XLogSegNo	endSegNo;
-
-				XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
-				if (sendSeg->ws_segno == endSegNo)
-					sendSeg->ws_tli = sendTimeLineNextTLI;
-			}
-
-			XLogFilePath(path, sendSeg->ws_tli, sendSeg->ws_segno, segcxt->ws_segsize);
-
-			sendSeg->ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-			if (sendSeg->ws_file < 0)
-			{
-				/*
-				 * If the file is not found, assume it's because the standby
-				 * asked for a too old WAL segment that has already been
-				 * removed or recycled.
-				 */
-				if (errno == ENOENT)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("requested WAL segment %s has already been removed",
-									XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno))));
-				else
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\": %m",
-									path)));
-			}
-			sendSeg->ws_off = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendSeg->ws_off != startoff)
-		{
-			if (lseek(sendSeg->ws_file, (off_t) startoff, SEEK_SET) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-								startoff)));
-			sendSeg->ws_off = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (segcxt->ws_segsize - startoff))
-			segbytes = segcxt->ws_segsize - startoff;
-		else
-			segbytes = nbytes;
-
-		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-		readbytes = read(sendSeg->ws_file, p, segbytes);
-		pgstat_report_wait_end();
-		if (readbytes < 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u, length %zu: %m",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, (Size) segbytes)));
-		}
-		else if (readbytes == 0)
-		{
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							XLogFileNameP(sendSeg->ws_tli, sendSeg->ws_segno),
-							sendSeg->ws_off, readbytes, (Size) segbytes)));
-		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendSeg->ws_off += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
+		XLByteToSeg(sendTimeLineValidUpto, endSegNo, segcxt->ws_segsize);
+		if (sendSeg->ws_segno == endSegNo)
+			*tli_p = sendTimeLineNextTLI;
 	}
 
-	/*
-	 * After reading into the buffer, check that what we read was valid. We do
-	 * this after reading, because even though the segment was present when we
-	 * opened it, it might get recycled or removed while we read it. The
-	 * read() succeeds in that case, but the data we tried to read might
-	 * already have been overwritten with new WAL records.
-	 */
-	XLByteToSeg(startptr, segno, segcxt->ws_segsize);
-	CheckXLogRemoved(segno, ThisTimeLineID);
+	XLogFilePath(path, *tli_p, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
 
 	/*
-	 * During recovery, the currently-open WAL file might be replaced with the
-	 * file of the same name retrieved from archive. So we always need to
-	 * check what we read was valid after reading into the buffer. If it's
-	 * invalid, we try to open and read the file again.
+	 * If the file is not found, assume it's because the standby asked for a
+	 * too old WAL segment that has already been removed or recycled.
 	 */
-	if (am_cascading_walsender)
-	{
-		WalSnd	   *walsnd = MyWalSnd;
-		bool		reload;
-
-		SpinLockAcquire(&walsnd->mutex);
-		reload = walsnd->needreload;
-		walsnd->needreload = false;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (reload && sendSeg->ws_file >= 0)
-		{
-			close(sendSeg->ws_file);
-			sendSeg->ws_file = -1;
-
-			goto retry;
-		}
-	}
+	if (errno == ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("requested WAL segment %s has already been removed",
+						XLogFileNameP(*tli_p, nextSegNo))));
+	else
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+	return -1;					/* keep compiler quiet */
 }
 
 /*
@@ -2562,6 +2466,8 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	XLogSegNo	segno;
+	WALReadError errinfo;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2777,7 +2683,49 @@ XLogSendPhysical(void)
 	 * calls.
 	 */
 	enlargeStringInfo(&output_message, nbytes);
-	XLogRead(sendCxt, &output_message.data[output_message.len], startptr, nbytes);
+
+retry:
+	if (!WALRead(&output_message.data[output_message.len],
+				 startptr,
+				 nbytes,
+				 sendSeg->ws_tli,	/* Pass the current TLI because only
+									 * WalSndSegmentOpen controls whether new
+									 * TLI is needed. */
+				 sendSeg,
+				 sendCxt,
+				 WalSndSegmentOpen,
+				 &errinfo))
+		WALReadRaiseError(&errinfo);
+
+	/* See logical_read_xlog_page(). */
+	XLByteToSeg(startptr, segno, sendCxt->ws_segsize);
+	CheckXLogRemoved(segno, sendSeg->ws_tli);
+
+	/*
+	 * During recovery, the currently-open WAL file might be replaced with the
+	 * file of the same name retrieved from archive. So we always need to
+	 * check what we read was valid after reading into the buffer. If it's
+	 * invalid, we try to open and read the file again.
+	 */
+	if (am_cascading_walsender)
+	{
+		WalSnd	   *walsnd = MyWalSnd;
+		bool		reload;
+
+		SpinLockAcquire(&walsnd->mutex);
+		reload = walsnd->needreload;
+		walsnd->needreload = false;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (reload && sendSeg->ws_file >= 0)
+		{
+			close(sendSeg->ws_file);
+			sendSeg->ws_file = -1;
+
+			goto retry;
+		}
+	}
+
 	output_message.len += nbytes;
 	output_message.data[output_message.len] = '\0';
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1524e5eb1e..d5eb2f42f5 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -280,137 +280,56 @@ identify_target_directory(char *directory, char *fname)
 	return NULL;				/* not reached */
 }
 
-/*
- * Read count bytes from a segment file in the specified directory, for the
- * given timeline, containing the specified record pointer; store the data in
- * the passed buffer.
- */
-static void
-XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
-				 XLogRecPtr startptr, char *buf, Size count)
+static int
+WALDumpOpenSegment(XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+				   TimeLineID *tli_p)
 {
-	char	   *p;
-	XLogRecPtr	recptr;
-	Size		nbytes;
+	TimeLineID	tli = *tli_p;
+	char		fname[MAXPGPATH];
+	int			fd;
+	int			tries;
 
-	static int	sendFile = -1;
-	static XLogSegNo sendSegNo = 0;
-	static uint32 sendOff = 0;
+	XLogFileName(fname, tli, nextSegNo, segcxt->ws_segsize);
 
-	p = buf;
-	recptr = startptr;
-	nbytes = count;
-
-	while (nbytes > 0)
+	/*
+	 * In follow mode there is a short period of time after the server has
+	 * written the end of the previous file before the new file is available.
+	 * So we loop for 5 seconds looking for the file to appear before giving
+	 * up.
+	 */
+	for (tries = 0; tries < 10; tries++)
 	{
-		uint32		startoff;
-		int			segbytes;
-		int			readbytes;
-
-		startoff = XLogSegmentOffset(recptr, WalSegSz);
-
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo, WalSegSz))
+		fd = open_file_in_directory(segcxt->ws_dir, fname);
+		if (fd >= 0)
+			return fd;
+		if (errno == ENOENT)
 		{
-			char		fname[MAXFNAMELEN];
-			int			tries;
-
-			/* Switch to another logfile segment */
-			if (sendFile >= 0)
-				close(sendFile);
-
-			XLByteToSeg(recptr, sendSegNo, WalSegSz);
-
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-			/*
-			 * In follow mode there is a short period of time after the server
-			 * has written the end of the previous file before the new file is
-			 * available. So we loop for 5 seconds looking for the file to
-			 * appear before giving up.
-			 */
-			for (tries = 0; tries < 10; tries++)
-			{
-				sendFile = open_file_in_directory(directory, fname);
-				if (sendFile >= 0)
-					break;
-				if (errno == ENOENT)
-				{
-					int			save_errno = errno;
-
-					/* File not there yet, try again */
-					pg_usleep(500 * 1000);
-
-					errno = save_errno;
-					continue;
-				}
-				/* Any other error, fall through and fail */
-				break;
-			}
-
-			if (sendFile < 0)
-				fatal_error("could not find file \"%s\": %s",
-							fname, strerror(errno));
-			sendOff = 0;
-		}
-
-		/* Need to seek in the file? */
-		if (sendOff != startoff)
-		{
-			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
-			{
-				int			err = errno;
-				char		fname[MAXPGPATH];
-
-				XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
-
-				fatal_error("could not seek in log file %s to offset %u: %s",
-							fname, startoff, strerror(err));
-			}
-			sendOff = startoff;
-		}
-
-		/* How many bytes are within this segment? */
-		if (nbytes > (WalSegSz - startoff))
-			segbytes = WalSegSz - startoff;
-		else
-			segbytes = nbytes;
-
-		readbytes = read(sendFile, p, segbytes);
-		if (readbytes <= 0)
-		{
-			int			err = errno;
-			char		fname[MAXPGPATH];
 			int			save_errno = errno;
 
-			XLogFileName(fname, timeline_id, sendSegNo, WalSegSz);
+			/* File not there yet, try again */
+			pg_usleep(500 * 1000);
+
 			errno = save_errno;
-
-			if (readbytes < 0)
-				fatal_error("could not read from log file %s, offset %u, length %d: %s",
-							fname, sendOff, segbytes, strerror(err));
-			else if (readbytes == 0)
-				fatal_error("could not read from log file %s, offset %u: read %d of %zu",
-							fname, sendOff, readbytes, (Size) segbytes);
+			continue;
 		}
-
-		/* Update state for read */
-		recptr += readbytes;
-
-		sendOff += readbytes;
-		nbytes -= readbytes;
-		p += readbytes;
+		/* Any other error, fall through and fail */
+		break;
 	}
+
+	fatal_error("could not find file \"%s\": %s", fname, strerror(errno));
+	return -1;					/* keep compiler quiet */
 }
 
 /*
  * XLogReader read_page callback
  */
 static int
-XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				 XLogRecPtr targetPtr, char *readBuff)
+WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
+				XLogRecPtr targetPtr, char *readBuff)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
+	WALReadError errinfo;
 
 	if (private->endptr != InvalidXLogRecPtr)
 	{
@@ -425,8 +344,23 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	XLogDumpXLogRead(state->segcxt.ws_dir, private->timeline, targetPagePtr,
-					 readBuff, count);
+	if (!WALRead(readBuff, targetPagePtr, count, private->timeline,
+				 &state->seg, &state->segcxt, WALDumpOpenSegment, &errinfo))
+	{
+		WALOpenSegment *seg = &errinfo.wre_seg;
+		char		fname[MAXPGPATH];
+
+		XLogFileName(fname, seg->ws_tli, seg->ws_segno,
+					 state->segcxt.ws_segsize);
+
+		if (errinfo.wre_errno != 0)
+			fatal_error("could not read in file %s, offset %u, length %zu: %s",
+						fname, errinfo.wre_off, (Size) errinfo.wre_req,
+						strerror(errinfo.wre_errno));
+		else
+			fatal_error("could not read in file %s, offset %u: length: %zu",
+						fname, errinfo.wre_off, (Size) errinfo.wre_req);
+	}
 
 	return count;
 }
@@ -1089,7 +1023,7 @@ main(int argc, char **argv)
 	/* done with argument parsing, do the actual work */
 
 	/* we have everything we need, start reading */
-	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, XLogDumpReadPage,
+	xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, WALDumpReadPage,
 										  &private);
 	if (!xlogreader_state)
 		fatal_error("out of memory");
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1bbee386e8..0193611b7f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -36,7 +36,6 @@ typedef struct WALOpenSegment
 {
 	int			ws_file;		/* segment file descriptor */
 	XLogSegNo	ws_segno;		/* segment number */
-	uint32		ws_off;			/* offset in the segment */
 	TimeLineID	ws_tli;			/* timeline ID of the currently open file */
 } WALOpenSegment;
 
@@ -168,6 +167,7 @@ struct XLogReaderState
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
+	uint32		segoff;
 
 	/*
 	 * beginning of prior page read, and its TLI.  Doesn't necessarily
@@ -217,6 +217,24 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/*
+ * Callback to open the specified WAL segment for reading.  Returns a valid
+ * file descriptor when the file was opened successfully.
+ *
+ * "nextSegNo" is the number of the segment to be opened.
+ *
+ * "segcxt" is additional information about the segment.
+ *
+ * "tli_p" is an input/output argument. XLogRead() uses it to pass the
+ * timeline in which the new segment should be found, but the callback can use
+ * it to return the TLI that it actually opened.
+ *
+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
+ */
+typedef int (*WALSegmentOpen) (XLogSegNo nextSegNo, WALSegmentContext *segcxt,
+							   TimeLineID *tli_p);
+
 /* Initialize supporting structures */
 extern void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 							   int segsize, const char *waldir);
@@ -232,6 +250,25 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
+
+/*
+ * Error information from WALRead that both backend and frontend caller can
+ * process.  Currently only errors from pg_pread can be reported.
+ */
+typedef struct WALReadError
+{
+	int			wre_errno;		/* errno set by the last pg_pread() */
+	int			wre_off;		/* Offset we tried to read from. */
+	int			wre_req;		/* Bytes requested to be read. */
+	int			wre_read;		/* Bytes read by the last read(). */
+	WALOpenSegment wre_seg;		/* Segment we tried to read from. */
+} WALReadError;
+
+extern bool WALRead(char *buf, XLogRecPtr startptr, Size count,
+					TimeLineID tli, WALOpenSegment *seg,
+					WALSegmentContext *segcxt, WALSegmentOpen openSegment,
+					WALReadError *errinfo);
+
 /* Functions for decoding an XLogRecord */
 
 extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 2df98e45b2..0572b24192 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -54,4 +54,6 @@ extern int	read_local_xlog_page(XLogReaderState *state,
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
 									  XLogRecPtr wantPage, uint32 wantLength);
 
+extern void WALReadRaiseError(WALReadError *errinfo);
+
 #endif
-- 
2.20.1

#58

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Alvaro Herrera (#57)

Re: Attempt to consolidate reading of XLOG page

On Fri, Nov 22, 2019 at 07:56:32PM -0300, Alvaro Herrera wrote:

I see no reason to leave ws_off. We can move that to XLogReaderState; I
did that here. We also need the offset in WALReadError, though, so I
added it there too. Conceptually it seems clearer to me this way.

Yeah, that seems cleaner.

What do you think of the attached?

Looks rather fine to me.

BTW I'm not clear what errors can pread()/pg_pread() report that do not
set errno. I think lines 1083/1084 of WALRead are spurious now.

Because we have no guarantee that errno will be cleared if you do a
partial read where errno is not set, so you may finish by reporting
the state of a previous failed read instead of the partially-failed
one depending on how WALReadError is treated? In short, I don't see
any actual reason why it would be good to remove the reset of errno
either before the calls to pread and pwrite().
--
Michael

#59

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Alvaro Herrera (#57)

1 attachment(s)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Nov-22, Antonin Houska wrote:

As I pointed out in

/messages/by-id/88183.1574261429@antos

seg.ws_off only replaced readOff in XLogReaderState. So we should only update
ws_off where readOff was updated before commit 709d003. This does happen in
ReadPageInternal (see HEAD) and I see no reason for the final patch to update
ws_off anywhere else.

Oh you're right.

I see no reason to leave ws_off. We can move that to XLogReaderState; I
did that here. We also need the offset in WALReadError, though, so I
added it there too. Conceptually it seems clearer to me this way.

What do you think of the attached?

It looks good to me. Attached is just a fix of a minor problem in error
reporting that Michael pointed out earlier.

BTW I'm not clear what errors can pread()/pg_pread() report that do not
set errno. I think lines 1083/1084 of WALRead are spurious now.

All I can say is that the existing calls of pg_pread() do not clear errno, so
you may be right. I'd appreciate more background about the "partial read" that
Michael mentions here:

/messages/by-id/20191125033048.GG37821@paquier.xyz

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

wal_read_errno.patchtext/x-diffDownload

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 04124bc254..eda81c1df1 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -354,9 +354,11 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 					 state->segcxt.ws_segsize);
 
 		if (errinfo.wre_errno != 0)
-			fatal_error("could not read in file %s, offset %u, length %zu: %s",
-						fname, errinfo.wre_off, (Size) errinfo.wre_req,
-						strerror(errinfo.wre_errno));
+		{
+			errno = errinfo.wre_errno;
+			fatal_error("could not read in file %s, offset %u, length %zu: %m",
+						fname, errinfo.wre_off, (Size) errinfo.wre_req);
+		}
 		else
 			fatal_error("could not read in file %s, offset %u: length: %zu",
 						fname, errinfo.wre_off, (Size) errinfo.wre_req);

#60

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Antonin Houska (#59)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-25, Antonin Houska wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I see no reason to leave ws_off. We can move that to XLogReaderState; I
did that here. We also need the offset in WALReadError, though, so I
added it there too. Conceptually it seems clearer to me this way.

What do you think of the attached?

It looks good to me. Attached is just a fix of a minor problem in error
reporting that Michael pointed out earlier.

Excellent, I pushed it with this change included and some other cosmetic
changes.

Now there's only XLogPageRead() ...

BTW I'm not clear what errors can pread()/pg_pread() report that do not
set errno. I think lines 1083/1084 of WALRead are spurious now.

All I can say is that the existing calls of pg_pread() do not clear errno, so
you may be right.

Right ... in this interface, we only report an error if pg_pread()
returns negative, which is documented to always set errno.

I'd appreciate more background about the "partial read" that
Michael mentions here:

/messages/by-id/20191125033048.GG37821@paquier.xyz

In the current implementation, if pg_pread() does a partial read, we
just loop one more time.

I considered changing the "if (readbytes <= 0)" with "if (readbytes <
segbytes)", but that seemed pointless.

However, writing this now makes me think that we should add a
CHECK_FOR_INTERRUPTS in this loop. (I also wonder if we shouldn't limit
the number of times we retry if pg_pread returns zero (i.e. no error,
but no bytes read either). I don't know if this is a real-world
consideration.)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#61

Antonin Houska

ah@cybertec.at

about 6 years ago

In reply to: Alvaro Herrera (#60)

Re: Attempt to consolidate reading of XLOG page

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Nov-25, Antonin Houska wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I see no reason to leave ws_off. We can move that to XLogReaderState; I
did that here. We also need the offset in WALReadError, though, so I
added it there too. Conceptually it seems clearer to me this way.

What do you think of the attached?

It looks good to me. Attached is just a fix of a minor problem in error
reporting that Michael pointed out earlier.

Excellent, I pushed it with this change included and some other cosmetic
changes.

Thanks!

Now there's only XLogPageRead() ...

Hm, this seems rather specific, not sure it's worth trying to use WALRead()
here. Anyway, I notice that it uses pg_read() too.

I'd appreciate more background about the "partial read" that
Michael mentions here:

/messages/by-id/20191125033048.GG37821@paquier.xyz

In the current implementation, if pg_pread() does a partial read, we
just loop one more time.

I considered changing the "if (readbytes <= 0)" with "if (readbytes <
segbytes)", but that seemed pointless.

In the pread() documentation I see "Upon reading end-of-file, zero is
returned." but that does not tell whether zero can be returned without
reaching EOF. However XLogPageRead() handles zero as an error, so WALRead() is
consistent with that.

However, writing this now makes me think that we should add a
CHECK_FOR_INTERRUPTS in this loop. (I also wonder if we shouldn't limit
the number of times we retry if pg_pread returns zero (i.e. no error,
but no bytes read either). I don't know if this is a real-world
consideration.)

If statement above is correct, then we shouldn't need this.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#62

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Antonin Houska (#47)

Re: Attempt to consolidate reading of XLOG page

On 2019-Nov-20, Antonin Houska wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What is logical_read_local_xlog_page all about? Seems useless. Let's
get rid of it.

It seems so. Should I post a patch for that?

No need .. it was simple enough. Just pushed it.

Thanks

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services