WALWriter active during recovery

Started by Simon Riggsabout 11 years ago16 messages

simon@2ndQuadrant.com

about 11 years ago

1 attachment(s)

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

At present this is a WIP patch, for code comments only. Don't bother
with anything other than code questions at this stage.

Implementation questions are

* How should we wake WALReceiver, since it waits on a poll(). Should
we use SIGUSR1, which is already used for latch waits, or another
signal?

* Should we introduce some pacing delays if the WALreceiver gets too
far ahead of apply?

* Other questions you may have?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

walwriter_active_in_recovery.v1.patchapplication/octet-stream; name=walwriter_active_in_recovery.v1.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0f09add..0f18931 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -170,9 +170,42 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 
 static XLogRecPtr LastRec;
 
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
-static TimeLineID receiveTLI = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
+static TimeLineID flushedTLI = 0;
+
+/* NOTES FOR REVIEWERs:
+ *
+ * Make WALWriter responsible for flushing received WAL data during recovery.
+ * At shutdown, WALReceiver flushes WAL just as it did before. WALReceiver
+ * now spends less time waiting and more time communicating, which should
+ * improve replication performance overall.
+ *
+ * New possibility of being swamped by incoming data that cannot be flushed
+ * fast enough; that is not addressed in current patch, but some form of
+ * safety valve would be required.
+ *
+ * WALWriter and WALReceiver now need to cooperate via shmem. Startup proc
+ * is unaffected, it just reads the current position, so it doesn't matter
+ * who updates it.
+ *
+ * Previously the WALReceiver was responsible for both writing and flushing
+ * WAL files during recovery. As a result some of the variable names were
+ * called "received" when in fact we meant "flushed", because it was at
+ * that time the same thing.
+ *
+ * WALReceiver now sends status messages earlier than it did before, making
+ * synchronous_commit = remote_write a more useful option. Some pacing is
+ * also required, since status messages could be sent more frequently than
+ * before.
+ *
+ * Attention needs to be paid to points where no further data is sent, since
+ * the WALReceiver must write, then wait for flush and apply. Both
+ * WalWriter and Startup process must wake the WALReceiver; not checked yet.
+ *
+ * WALWriter needs to be started during recovery, but no other behaviour
+ * changes at the postmaster level, since we do not depend upon it.
+ */
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -8278,7 +8311,7 @@ CreateRestartPoint(int flags)
 		 * Get the current end of xlog replayed or received, whichever is
 		 * later.
 		 */
-		receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+		receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 		replayPtr = GetXLogReplayRecPtr(&replayTLI);
 		endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 
@@ -10173,7 +10206,7 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 receivedUpto < targetPagePtr + reqLen))
+		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -10204,10 +10237,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = receivedUpto % XLogSegSize - targetPageOff;
+			readLen = flushedUpto % XLogSegSize - targetPageOff;
 	}
 	else
 		readLen = XLOG_BLCKSZ;
@@ -10387,7 +10420,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						curFileTLI = tli;
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName);
-						receivedUpto = 0;
+						flushedUpto = 0;
 					}
 
 					/*
@@ -10535,14 +10568,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
 					 */
-					if (RecPtr < receivedUpto)
+					if (RecPtr < flushedUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &flushedTLI);
+						if (RecPtr < flushedUpto && flushedTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -10568,9 +10601,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(flushedTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													flushedTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 133143d..216da59 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -216,7 +216,7 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5106f52..a21a4f2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1590,7 +1590,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and walwriter.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1599,17 +1600,11 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (WalWriterPID == 0)
+				WalWriterPID = StartWalWriter();
 		}
 
 		/*
-		 * Likewise, if we have lost the walwriter process, try to start a new
-		 * one.  But this is needed only in normal operation (else we cannot
-		 * be writing any new WAL).
-		 */
-		if (WalWriterPID == 0 && pmState == PM_RUN)
-			WalWriterPID = StartWalWriter();
-
-		/*
 		 * If we have lost the autovacuum launcher, try to start a new one. We
 		 * don't want autovacuum to run in binary upgrade mode because
 		 * autovacuum might update relfrozenxid for empty tables before the
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0826f88..6c75c94 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -50,6 +50,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "postmaster/walwriter.h"
+#include "replication/walreceiver.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -248,6 +249,7 @@ WalWriterMain(void)
 	{
 		long		cur_timeout;
 		int			rc;
+		bool		work_done;
 
 		/*
 		 * Advertise whether we might hibernate in this cycle.  We do this
@@ -285,7 +287,12 @@ WalWriterMain(void)
 		 * Do what we're here for; then, if XLogBackgroundFlush() found useful
 		 * work to do, reset hibernation counter.
 		 */
-		if (XLogBackgroundFlush())
+		if (RecoveryInProgress())
+			work_done = XLogFlushReceived();
+		else
+			work_done = XLogBackgroundFlush();
+
+		if (work_done)
 			left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
 		else if (left_till_hibernate > 0)
 			left_till_hibernate--;
@@ -313,7 +320,6 @@ WalWriterMain(void)
 	}
 }
 
-
 /* --------------------------------
  *		signal handler routines
  * --------------------------------
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..a668c89 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
  * in the primary server), and then keeps receiving XLOG records and
  * writing them to the disk as long as the connection is alive. As XLOG
  * records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
  * If the primary server ends streaming, but doesn't disconnect, walreceiver
@@ -56,6 +56,7 @@
 #include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
@@ -136,6 +137,7 @@ static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last);
 static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI);
 static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
+static void WalRcvUpdateProcessTitle(void);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
@@ -546,6 +548,8 @@ WalReceiverMain(void)
 						 errmsg("could not close log segment %s: %m",
 								XLogFileNameP(recvFileTLI, recvSegNo))));
 
+			WalRcvUpdateProcessTitle();
+
 			/*
 			 * Create .done file forcibly to prevent the streamed segment from
 			 * being archived later.
@@ -903,6 +907,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 							 errmsg("could not close log segment %s: %m",
 									XLogFileNameP(recvFileTLI, recvSegNo))));
 
+				WalRcvUpdateProcessTitle();
+
 				/*
 				 * Create .done file forcibly to prevent the streamed segment
 				 * from being archived later.
@@ -977,48 +983,131 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 static void
 XLogWalRcvFlush(bool dying)
 {
-	if (LogstreamResult.Flush < LogstreamResult.Write)
+	/* use volatile pointer to prevent code rearrangement */
+	volatile WalRcvData *walrcv = WalRcv;
+
+	if (!dying)
+	{
+		/* Update shared-memory status */
+		SpinLockAcquire(&walrcv->mutex);
+		if (walrcv->receivedUpto < LogstreamResult.Write)
+		{
+			walrcv->latestChunkStart = walrcv->receivedUpto;
+			walrcv->receivedUpto = LogstreamResult.Write;
+			walrcv->receivedTLI = ThisTimeLineID;
+		}
+		/*
+		 * Track the progress of the walwriter. We do this
+		 * everytime we receive new data, to avoid an extra
+		 * spinlock cycle for each status message.
+		 */
+		LogstreamResult.Flush = walrcv->flushedUpto;
+		SpinLockRelease(&walrcv->mutex);
+
+		if (ProcGlobal->walwriterLatch)
+			SetLatch(ProcGlobal->walwriterLatch);
+
+		/* Also let the master know that we made some progress */
+		XLogWalRcvSendReply(false, false);
+		XLogWalRcvSendHSFeedback(false);
+	}
+	else if (LogstreamResult.Flush < LogstreamResult.Write)
 	{
-		/* use volatile pointer to prevent code rearrangement */
-		volatile WalRcvData *walrcv = WalRcv;
+		WalRcvUpdateProcessTitle();
 
+		/* Ignore the walwriter, just do it */
 		issue_xlog_fsync(recvFile, recvSegNo);
 
 		LogstreamResult.Flush = LogstreamResult.Write;
 
 		/* Update shared-memory status */
 		SpinLockAcquire(&walrcv->mutex);
-		if (walrcv->receivedUpto < LogstreamResult.Flush)
+		if (walrcv->flushedUpto < LogstreamResult.Flush)
 		{
-			walrcv->latestChunkStart = walrcv->receivedUpto;
-			walrcv->receivedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedUpto = LogstreamResult.Flush;
+			walrcv->flushedTLI = walrcv->receivedTLI;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
+		/* Signal the walwriter process that new WAL has arrived */
 		/* Signal the startup process and walsender that new WAL has arrived */
 		WakeupRecovery();
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+	}
+}
 
-		/* Report XLOG streaming progress in PS display */
-		if (update_process_title)
-		{
-			char		activitymsg[50];
+static void
+WalRcvUpdateProcessTitle(void)
+{
+	/* Report XLOG streaming progress in PS display */
+	if (update_process_title)
+	{
+		char		activitymsg[50];
 
-			snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X",
-					 (uint32) (LogstreamResult.Write >> 32),
-					 (uint32) LogstreamResult.Write);
-			set_ps_display(activitymsg, false);
-		}
+		snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X",
+				 (uint32) (LogstreamResult.Write >> 32),
+				 (uint32) LogstreamResult.Write);
+		set_ps_display(activitymsg, false);
+	}
+}
 
-		/* Also let the master know that we made some progress */
-		if (!dying)
-		{
-			XLogWalRcvSendReply(false, false);
-			XLogWalRcvSendHSFeedback(false);
-		}
+/*
+ * XLogFlushReceived() is executed only within walwriter
+ *
+ * Returns true if WAL file was flushed
+ */
+bool
+XLogFlushReceived(void)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile WalRcvData *walrcv = WalRcv;
+	bool	flush = false;
+
+	/* Update shared-memory status */
+	SpinLockAcquire(&walrcv->mutex);
+	if (walrcv->flushedUpto < walrcv->receivedUpto)
+	{
+		LogstreamResult.Write = walrcv->receivedUpto;
+		if (ThisTimeLineID != walrcv->receivedTLI)
+			ThisTimeLineID = recvFileTLI = walrcv->receivedTLI;
+		flush = true;
 	}
+	SpinLockRelease(&walrcv->mutex);
+
+	if (!flush)
+		return false;
+
+	/*
+	 * Open file, fsync it and then close it again.
+	 */
+	XLByteToPrevSeg(LogstreamResult.Write, recvSegNo);
+	recvFile = XLogFileOpen(recvSegNo);
+	issue_xlog_fsync(recvFile, recvSegNo);
+	if (close(recvFile) != 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not close log segment %s: %m",
+					XLogFileNameP(recvFileTLI, recvSegNo))));
+
+	LogstreamResult.Flush = LogstreamResult.Write;
+
+	/* Update shared-memory status */
+	SpinLockAcquire(&walrcv->mutex);
+	if (walrcv->flushedUpto < LogstreamResult.Flush)
+	{
+		walrcv->flushedUpto = LogstreamResult.Flush;
+		walrcv->flushedTLI = walrcv->receivedTLI;
+	}
+	SpinLockRelease(&walrcv->mutex);
+
+	/* Signal the walwriter process that new WAL has arrived */
+	/* Signal the startup process and walsender that new WAL has arrived */
+	WakeupRecovery();
+	if (AllowCascadeReplication())
+		WalSndWakeup();
+
+	return true;
 }
 
 /*
@@ -1040,8 +1129,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
-	static TimestampTz sendTime = 0;
-	TimestampTz now;
+	static TimestampTz idleTime = 0;
+	static bool idle = false;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1050,29 +1139,39 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	if (!force && wal_receiver_status_interval <= 0)
 		return;
 
-	/* Get current timestamp. */
-	now = GetCurrentTimestamp();
-
 	/*
-	 * We can compare the write and flush positions to the last message we
-	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
+	 * We need to grab a spinlock for flush position, but we can avoid
+	 * getting the current timestamp if we are going to send a message
+	 * anyway.  Getting the apply position requires a spin lock, so we
+	 * don't check that unless something else has changed or 10
 	 * seconds have passed.  This means that the apply log position will
 	 * appear, from the master's point of view, to lag slightly, but since
 	 * this is only for reporting purposes and only on idle systems, that's
 	 * probably OK.
 	 */
+	LogstreamResult.Flush = GetWalRcvFlushRecPtr(NULL, NULL);
 	if (!force
 		&& writePtr == LogstreamResult.Write
 		&& flushPtr == LogstreamResult.Flush
-		&& !TimestampDifferenceExceeds(sendTime, now,
-									   wal_receiver_status_interval * 1000))
+		&& !idle)
+	{
+		idleTime = GetCurrentTimestamp();
+		idle = true;
 		return;
-	sendTime = now;
+	}
+
+	if (idle)
+	{
+		TimestampTz now = GetCurrentTimestamp();
+		if (!TimestampDifferenceExceeds(idleTime, now,
+									   wal_receiver_status_interval * 1000))
+			return;
+		idle = false;
+	}
 
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
-	flushPtr = LogstreamResult.Flush;
+	flushPtr = LogstreamResult.Flush; /* updated above */
 	applyPtr = GetXLogReplayRecPtr(NULL);
 
 	resetStringInfo(&reply_message);
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 579216a..ed2eab6 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -268,12 +268,14 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	/*
 	 * If this is the first startup of walreceiver (on this timeline),
-	 * initialize receivedUpto and latestChunkStart to the starting point.
+	 * initialize all XLogRecPtrs to the starting point.
 	 */
 	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
 	{
 		walrcv->receivedUpto = recptr;
 		walrcv->receivedTLI = tli;
+		walrcv->flushedUpto = recptr;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -296,14 +298,14 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&walrcv->mutex);
-	recptr = walrcv->receivedUpto;
+	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
 	if (receiveTLI)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5937cbb..8684098 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2520,8 +2520,8 @@ GetStandbyFlushRecPtr(void)
 {
 	XLogRecPtr	replayPtr;
 	TimeLineID	replayTLI;
-	XLogRecPtr	receivePtr;
-	TimeLineID	receiveTLI;
+	XLogRecPtr	flushPtr;
+	TimeLineID	flushTLI;
 	XLogRecPtr	result;
 
 	/*
@@ -2530,14 +2530,14 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	flushPtr = GetWalRcvFlushRecPtr(NULL, &flushTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
 
 	result = replayPtr;
-	if (receiveTLI == ThisTimeLineID && receivePtr > replayPtr)
-		result = receivePtr;
+	if (flushTLI == ThisTimeLineID && flushPtr > replayPtr)
+		result = flushPtr;
 
 	return result;
 }
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 7a249f1..ff3312d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -79,6 +79,16 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * flushedUpto-1 is the last byte position that has already been
+	 * flushed, and flushedTLI is the timeline it came from.  At the first
+	 * startup of walreceiver, these are set to receiveStart and
+	 * receiveStartTLI. After that, these are updated whenever a process
+	 * flushes the received WAL to disk.
+	 */
+	XLogRecPtr	flushedUpto;
+	TimeLineID	flushedTLI;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -148,6 +158,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
 
 /* prototypes for functions in walreceiver.c */
 extern void WalReceiverMain(void) __attribute__((noreturn));
+extern bool XLogFlushReceived(void);
 
 /* prototypes for functions in walreceiverfuncs.c */
 extern Size WalRcvShmemSize(void);
@@ -157,7 +168,7 @@ extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 					 const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4ad4164..b5fba3a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,10 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer and WAL writer run always.
+ * Startup process and WAL receiver also consume 2 slots during recovery.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Simon Riggs (#1)

Re: WALWriter active during recovery

Hi,

On 2014-12-15 18:51:44 +0000, Simon Riggs wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Well, it can still buffer data on the network level, but there's
definitely limits to that. So I can see this as being useful.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

At present this is a WIP patch, for code comments only. Don't bother
with anything other than code questions at this stage.

Implementation questions are

* How should we wake WALReceiver, since it waits on a poll(). Should
we use SIGUSR1, which is already used for latch waits, or another
signal?

It's not entirely trivial, but also not hard, to make it use the latch
code for waiting. It'd probably end up requiring less code because then
we could just scratch libqpwalreceiver.c:libpq_select().

* Should we introduce some pacing delays if the WALreceiver gets too
far ahead of apply?

Hm. Why don't we simply start fsyncing in the receiver itself at regular
intervals? If already synced that's cheap, if not, it'll pace us.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 11 years ago

In reply to: Simon Riggs (#1)

Re: WALWriter active during recovery

On 12/15/2014 08:51 PM, Simon Riggs wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

What other useful actions can WAL receiver do while it's waiting? It
doesn't do much else than receive WAL, and fsync it to disk.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Heikki Linnakangas (#3)

Re: WALWriter active during recovery

On 2014-12-16 16:12:40 +0200, Heikki Linnakangas wrote:

On 12/15/2014 08:51 PM, Simon Riggs wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

What other useful actions can WAL receiver do while it's waiting? It doesn't
do much else than receive WAL, and fsync it to disk.

It can actually receive further data from the network and write it to
disk? On a relatively low latency network the buffers aren't that
large. Right now we generate quite a bursty IO pattern with the disks
alternating between idle and fully busy.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

about 11 years ago

In reply to: Heikki Linnakangas (#3)

Re: WALWriter active during recovery

On 16 December 2014 at 14:12, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/15/2014 08:51 PM, Simon Riggs wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

What other useful actions can WAL receiver do while it's waiting? It doesn't
do much else than receive WAL, and fsync it to disk.

So now it will only need to do one of those two things.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

didier

did447@gmail.com

about 11 years ago

In reply to: Simon Riggs (#5)

Re: WALWriter active during recovery

Hi,

On Tue, Dec 16, 2014 at 6:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 16 December 2014 at 14:12, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/15/2014 08:51 PM, Simon Riggs wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

On many Linux systems it may not do that much (2.6.32 and 3.2 are bad,
3.13 is better but still it slows the fsync).

If there's a fsync in progress WALReceiver will:
1- slow the fsync because its writes to the same file are grabbed by the fsync
2- stall until the end of fsync.

from 'stracing' a test program simulating this pattern:
two processes, one writes to a file the second fsync it.

20279 11:51:24.037108 fsync(5 <unfinished ...>
20278 11:51:24.053524 <... nanosleep resumed> NULL) = 0 <0.020281>
20278 11:51:24.053691 lseek(3, 1383612416, SEEK_SET) = 1383612416 <0.000119>
20278 11:51:24.053965 write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,
8192) = 8192 <0.000111>
20278 11:51:24.054190 nanosleep({0, 20000000}, NULL) = 0 <0.020243>
....
20278 11:51:24.404386 lseek(3, 194772992, SEEK_SET <unfinished ...>
20279 11:51:24.754123 <... fsync resumed> ) = 0 <0.716971>
20279 11:51:24.754202 close(5 <unfinished ...>
20278 11:51:24.754232 <... lseek resumed> ) = 194772992 <0.349825>

Yes that's a 300ms lseek...

What other useful actions can WAL receiver do while it's waiting? It doesn't
do much else than receive WAL, and fsync it to disk.

So now it will only need to do one of those two things.

Regards
Didier

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

about 11 years ago

In reply to: didier (#6)

Re: WALWriter active during recovery

On 17 December 2014 at 11:27, didier <did447@gmail.com> wrote:

If there's a fsync in progress WALReceiver will:
1- slow the fsync because its writes to the same file are grabbed by the fsync
2- stall until the end of fsync.

PostgreSQL already fsyncs files while they are being written to. Are
you saying we should stop doing that?

It would be possible to synchronize processes so that we don't write
to a file while it is being fsynced.

fsyncs are also made once the whole 16MB has been written, so in those
cases there is no simultaneous action.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 11 years ago

In reply to: didier (#6)

Re: WALWriter active during recovery

didier wrote:

On many Linux systems it may not do that much (2.6.32 and 3.2 are bad,
3.13 is better but still it slows the fsync).

If there's a fsync in progress WALReceiver will:
1- slow the fsync because its writes to the same file are grabbed by the fsync
2- stall until the end of fsync.

Is this behavior filesystem-dependent?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

didier

did447@gmail.com

about 11 years ago

In reply to: Alvaro Herrera (#8)

1 attachment(s)

Re: WALWriter active during recovery

On Wed, Dec 17, 2014 at 2:39 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

didier wrote:

On many Linux systems it may not do that much (2.6.32 and 3.2 are bad,
3.13 is better but still it slows the fsync).

If there's a fsync in progress WALReceiver will:
1- slow the fsync because its writes to the same file are grabbed by the fsync
2- stall until the end of fsync.

Is this behavior filesystem-dependent?

I don't know. I only tested ext4

Attach the trivial code I used, there's a lot of junk in it.

Didier

#10

Fujii Masao

masao.fujii@gmail.com

about 11 years ago

In reply to: Simon Riggs (#1)

Re: WALWriter active during recovery

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

At present this is a WIP patch, for code comments only. Don't bother
with anything other than code questions at this stage.

Implementation questions are

* How should we wake WALReceiver, since it waits on a poll(). Should
we use SIGUSR1, which is already used for latch waits, or another
signal?

Probably we need to change libpqwalreceiver so that it uses the latch.
This is useful even for the startup process to report the replay location to
the walreceiver in real time.

* Should we introduce some pacing delays if the WALreceiver gets too
far ahead of apply?

I don't think so for now. Instead, we can support synchronous_commit = replay,
and the users can use that new mode if they are worried about the delay of
WAL replay.

* Other questions you may have?

Who should wake the startup process so that it reads and replays the WAL data?
Current walreceiver. But if walwriter is responsible for fsyncing WAL data,
probably walwriter should do that. Because the startup process should not replay
the WAL data which has not been fsync'd yet.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Michael Paquier

michael.paquier@gmail.com

almost 11 years ago

In reply to: Fujii Masao (#10)

Re: WALWriter active during recovery

On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

+1

At present this is a WIP patch, for code comments only. Don't bother
with anything other than code questions at this stage.

Implementation questions are

* How should we wake WALReceiver, since it waits on a poll(). Should
we use SIGUSR1, which is already used for latch waits, or another
signal?

Probably we need to change libpqwalreceiver so that it uses the latch.
This is useful even for the startup process to report the replay location
to
the walreceiver in real time.

* Should we introduce some pacing delays if the WALreceiver gets too
far ahead of apply?

I don't think so for now. Instead, we can support synchronous_commit =
replay,
and the users can use that new mode if they are worried about the delay of
WAL replay.

* Other questions you may have?

Who should wake the startup process so that it reads and replays the WAL
data?
Current walreceiver. But if walwriter is responsible for fsyncing WAL data,
probably walwriter should do that. Because the startup process should not
replay
the WAL data which has not been fsync'd yet.

Moved this patch to CF 2015-02 to not lose track of it and because it did
not get any reviews.
--
Michael

#12

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Fujii Masao (#10)

Re: WALWriter active during recovery

On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

With the patch, replication didn't work fine in my machine. I started
the standby server after removing all the WAL files from the standby.
ISTM that the patch doesn't handle that case. That is, in that case,
the standby tries to start up walreceiver and replication to retrieve
the REDO-starting checkpoint record *before* starting up walwriter
(IOW, before reaching the consistent point). Then since walreceiver works
without walwriter, no received WAL data cannot be fsync'd in the standby.
So replication cannot advance furthermore. I think that walwriter needs
to start before walreceiver starts.

I just marked this patch as Waiting on Author.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Fujii Masao

masao.fujii@gmail.com

over 10 years ago

In reply to: Fujii Masao (#12)

Re: WALWriter active during recovery

On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

With the patch, replication didn't work fine in my machine. I started
the standby server after removing all the WAL files from the standby.
ISTM that the patch doesn't handle that case. That is, in that case,
the standby tries to start up walreceiver and replication to retrieve
the REDO-starting checkpoint record *before* starting up walwriter
(IOW, before reaching the consistent point). Then since walreceiver works
without walwriter, no received WAL data cannot be fsync'd in the standby.
So replication cannot advance furthermore. I think that walwriter needs
to start before walreceiver starts.

I just marked this patch as Waiting on Author.

This patch was moved to current CF with the status "Needs review".
But there are already some review comments which have not been addressed yet,
so I marked the patch as "Waiting on Author" again.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Fujii Masao (#13)

Re: WALWriter active during recovery

On 2 July 2015 at 14:31, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao <masao.fujii@gmail.com>

wrote:

On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs <simon@2ndquadrant.com>

wrote:

Currently, WALReceiver writes and fsyncs data it receives. Clearly,
while we are waiting for an fsync we aren't doing any other useful
work.

Following patch starts WALWriter during recovery and makes it
responsible for fsyncing data, allowing WALReceiver to progress other
useful actions.

With the patch, replication didn't work fine in my machine. I started
the standby server after removing all the WAL files from the standby.
ISTM that the patch doesn't handle that case. That is, in that case,
the standby tries to start up walreceiver and replication to retrieve
the REDO-starting checkpoint record *before* starting up walwriter
(IOW, before reaching the consistent point). Then since walreceiver works
without walwriter, no received WAL data cannot be fsync'd in the standby.
So replication cannot advance furthermore. I think that walwriter needs
to start before walreceiver starts.

I just marked this patch as Waiting on Author.

This patch was moved to current CF with the status "Needs review".
But there are already some review comments which have not been addressed
yet,
so I marked the patch as "Waiting on Author" again.

This was pushed back from last CF and I haven't worked on it at all, nor
will I.

Pushing back again.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Simon Riggs (#14)

Re: WALWriter active during recovery

On 2015-07-02 14:34:48 +0100, Simon Riggs wrote:

This was pushed back from last CF and I haven't worked on it at all, nor
will I.

Pushing back again.

Let's "return with feedback", not " move", it then.. Moving a entries
along which aren't expected to receive updates anytime soon isn't a good
idea, there's more than enough entries each CF.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Andres Freund (#15)

Re: WALWriter active during recovery

On 2 July 2015 at 14:38, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-02 14:34:48 +0100, Simon Riggs wrote:

This was pushed back from last CF and I haven't worked on it at all, nor
will I.

Pushing back again.

Let's "return with feedback", not " move", it then.. Moving a entries
along which aren't expected to receive updates anytime soon isn't a good
idea, there's more than enough entries each CF.

Although I agree, the interface won't let me do that, so will leave as-is.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services