Measuring replay lag

Started by Thomas Munroabout 9 years ago55 messages
#1Thomas Munro
thomas.munro@enterprisedb.com
1 attachment(s)

Hi hackers,

Here is a new version of my patch to add a replay_lag column to the
pg_stat_replication view (originally proposed as part of a larger
patch set for 9.6[1]/messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com), like this:

postgres=# select application_name, replay_lag from pg_stat_replication;
┌──────────────────┬─────────────────┐
│ application_name │ replay_lag │
├──────────────────┼─────────────────┤
│ replica1 │ 00:00:00.595382 │
│ replica2 │ 00:00:00.598448 │
│ replica3 │ 00:00:00.541597 │
│ replica4 │ 00:00:00.551227 │
└──────────────────┴─────────────────┘
(4 rows)

It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.

Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:

1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).

In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.

In the last thread[1]/messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com, Robert Haas wrote:

Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.

Fixed.

In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)

I have added lots of comments.

When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.

Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.

Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.

I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.

By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch. That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?

I plan to post a new "causal reads" patch at some point which will
depend on this, but in any case I think this is a useful feature on
its own. I'd be grateful for any feedback, flames, better ideas etc.
Thanks for reading.

[1]: /messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replay-lag-v12.patchapplication/octet-stream; name=replay-lag-v12.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..fb39f4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3256,6 +3256,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-relay-lag-sample-interval" xreflabel="replay_lag_sample_interval">
+      <term><varname>replay_lag_sample_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replay_lag_sample_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls how often a standby should sample replay lag information to
+        send back to the primary or upstream standby while replaying WAL.  The
+        default is 1 second.  Units are milliseconds if not specified.  A
+        value of -1 disables the reporting of replay lag.  Estimated replay lag
+        can be seen in the <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view of the upstream server.
+        This parameter can only be set
+        in the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby_feedback">
       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..a4cb0e4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1381,6 +1381,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d6c057a..7091fde 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,8 @@ extern uint32 bootstrap_data_checksum_version;
 #define PROMOTE_SIGNAL_FILE		"promote"
 #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
+/* Size of the circular buffer of timestamped LSNs. */
+#define XLOG_TIMESTAMP_BUFFER_SIZE 8192
 
 /* User-settable parameters */
 int			max_wal_size = 64;	/* 1 GB */
@@ -521,6 +523,26 @@ typedef struct XLogCtlInsert
 } XLogCtlInsert;
 
 /*
+ * A sample associating a timestamp with a given xlog position.
+ */
+typedef struct XLogTimestamp
+{
+	TimestampTz	timestamp;
+	XLogRecPtr	lsn;
+} XLogTimestamp;
+
+/*
+ * A circular buffer of LSNs and associated timestamps.  The buffer is empty
+ * when read_head == write_head.
+ */
+typedef struct XLogTimestampBuffer
+{
+	uint32			read_head;
+	uint32			write_head;
+	XLogTimestamp	buffer[XLOG_TIMESTAMP_BUFFER_SIZE];
+} XLogTimestampBuffer;
+
+/*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
@@ -635,6 +657,12 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/* timestamp from the most recently applied record associated with a timestamp. */
+	TimestampTz lastReplayedTimestamp;
+
+	/* a buffer of upstream timestamps for WAL that is not yet applied. */
+	XLogTimestampBuffer timestamps;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -5977,6 +6005,44 @@ CheckRequiredParameterValues(void)
 }
 
 /*
+ * Called by the startup process after it has replayed up to 'lsn'.  Checks
+ * for timestamps associated with WAL positions that have now been replayed.
+ * If any are found, the latest such timestamp found is written to
+ * '*timestamp'.  Returns the new buffer read head position, which the caller
+ * should write into XLogCtl->timestamps.read_head while holding info_lck.
+ */
+static uint32
+CheckForReplayedTimestamps(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	uint32 read_head;
+
+	/*
+	 * It's OK to access timestamps.read_head without any kind synchronization
+	 * because this process is the only one to write to it.
+	 */
+	Assert(AmStartupProcess());
+	read_head = XLogCtl->timestamps.read_head;
+
+	/*
+	 * It's OK to access write_head without interlocking because it's an
+	 * aligned 32 bit value which we can read atomically on all supported
+	 * platforms to get some recent value, not a torn/garbage value.
+	 * Furthermore we must see a value that is at least as recent as any WAL
+	 * that we have replayed, because walreceiver calls
+	 * SetXLogReplayTimestampAtLsn before passing the corresponding WAL data
+	 * to the recovery process.
+	 */
+	while (read_head != XLogCtl->timestamps.write_head &&
+		   XLogCtl->timestamps.buffer[read_head].lsn <= lsn)
+	{
+		*timestamp = XLogCtl->timestamps.buffer[read_head].timestamp;
+		read_head = (read_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	return read_head;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
@@ -6795,6 +6861,8 @@ StartupXLOG(void)
 			do
 			{
 				bool		switchedTLI = false;
+				TimestampTz	replayed_timestamp = 0;
+				uint32		timestamp_read_head;
 
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
@@ -6948,24 +7016,34 @@ StartupXLOG(void)
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
+				/* Check if we have replayed a timestamped WAL position */
+				timestamp_read_head =
+					CheckForReplayedTimestamps(EndRecPtr, &replayed_timestamp);
+
 				/*
-				 * Update lastReplayedEndRecPtr after this record has been
-				 * successfully replayed.
+				 * Update lastReplayedEndRecPtr and lastReplayedTimestamp
+				 * after this record has been successfully replayed.
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->timestamps.read_head = timestamp_read_head;
+				if (replayed_timestamp != 0)
+					XLogCtl->lastReplayedTimestamp = replayed_timestamp;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
-				 * lastReplayedEndRecPtr and sends a reply to the master.
+				 * lastReplayedEndRecPtr and sends a reply to the master.  We
+				 * also wake it if we have replayed a WAL position that has
+				 * an associated timestamp so that the upstream server can
+				 * measure our replay lag.
 				 */
-				if (doRequestWalReceiverReply)
+				if (doRequestWalReceiverReply || replayed_timestamp != 0)
 				{
 					doRequestWalReceiverReply = false;
-					WalRcvForceReply();
+					WalRcvForceReply(replayed_timestamp != 0);
 				}
 
 				/* Remember this record as the last-applied one */
@@ -11720,3 +11798,81 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Record the timestamp that is associated with a WAL position.
+ *
+ * This is called by walreceiver on standby servers when new messages arrive,
+ * using a timestamp and the latest known WAL position from the upstream
+ * server.  The timestamp will be sent back to the upstream server via
+ * walreceiver when the recovery process has applied the WAL position.  The
+ * upstream server can then compute the elapsed time to estimate the replay
+ * lag.
+ */
+void
+SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
+{
+	Assert(AmWalReceiverProcess());
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn == XLogCtl->lastReplayedEndRecPtr)
+	{
+		/*
+		 * That is the last replayed LSN: we are fully replayed, so we can
+		 * update the replay timestamp immediately.
+		 */
+		XLogCtl->lastReplayedTimestamp = timestamp;
+	}
+	else
+	{
+		/*
+		 * There is WAL still to be applied.  We will associate the timestamp
+		 * with this WAL position and wait for it to be replayed.  We add it
+		 * at the 'write' end of the circular buffer of LSN/timestamp
+		 * mappings, which the replay loop will eventually read.
+		 */
+		uint32 write_head = XLogCtl->timestamps.write_head;
+		uint32 new_write_head = (write_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+
+		if (new_write_head == XLogCtl->timestamps.read_head)
+		{
+			/*
+			 * The buffer is full, so we'll rewind and overwrite the most
+			 * recent sample.  Overwriting the most recent sample means that
+			 * if we're not replaying fast enough and the buffer fills up,
+			 * we'll effectively lower the sampling rate.
+			 */
+			new_write_head = write_head;
+			write_head = (write_head - 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+		}
+
+		XLogCtl->timestamps.buffer[write_head].lsn = lsn;
+		XLogCtl->timestamps.buffer[write_head].timestamp = timestamp;
+		XLogCtl->timestamps.write_head = new_write_head;
+	}
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Get the timestamp for the most recently applied WAL record that carried a
+ * timestamp from the upstream server, and also the most recently applied LSN.
+ * (Note that the timestamp and the LSN don't necessarily relate to the same
+ * record.)
+ *
+ * This is similar to GetLatestXTime, except that it is advanced when WAL
+ * positions recorded with SetXLogReplayTimestampAtLsn have been applied,
+ * rather than commit records.
+ */
+TimestampTz
+GetXLogReplayTimestamp(XLogRecPtr *lsn)
+{
+	TimestampTz result;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn)
+		*lsn = XLogCtl->lastReplayedEndRecPtr;
+	result = XLogCtl->lastReplayedTimestamp;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ada2142..69df784 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -662,6 +662,7 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2bb3dce..f5a10e9 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,6 +73,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			replay_lag_sample_interval;
 
 /* libpqreceiver hooks to these when loaded */
 walrcv_connect_type walrcv_connect = NULL;
@@ -145,7 +146,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool includeApplyTimestamp);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 
@@ -457,7 +458,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -494,6 +495,8 @@ WalReceiverMain(void)
 					ResetLatch(&walrcv->latch);
 					if (walrcv->force_reply)
 					{
+						bool timestamp = walrcv->force_reply_apply_timestamp;
+
 						/*
 						 * The recovery process has asked us to send apply
 						 * feedback now.  Make sure the flag is really set to
@@ -501,8 +504,9 @@ WalReceiverMain(void)
 						 * we don't miss a new request for a reply.
 						 */
 						walrcv->force_reply = false;
+						walrcv->force_reply_apply_timestamp = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, timestamp);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -560,7 +564,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -911,7 +915,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1074,7 +1078,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1092,15 +1096,18 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If 'reportApplyTimestamp' is true, the latest apply timestamp is included.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool reportApplyTimestamp)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
+	TimestampTz applyTimestamp = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1132,7 +1139,35 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyTimestamp = GetXLogReplayTimestamp(&applyPtr);
+
+	/* Decide whether to send an apply timestamp for replay lag estimation. */
+	if (replay_lag_sample_interval != -1)
+	{
+		static TimestampTz lastTimestampSendTime = 0;
+
+		/*
+		 * Only send an apply timestamp if we were explicitly asked to by the
+		 * recovery process or if replay lag sampling is active but the
+		 * recovery process seems to be stuck.
+		 *
+		 * If we haven't heard from the recovery process in a time exceeding
+		 * wal_receiver_status_interval and yet it has not applied the highest
+		 * LSN we've heard about, then we want to resend the last replayed
+		 * timestamp we have; otherwise we zero it out and wait for the
+		 * recovery process to wake us when it has set a new accurate replay
+		 * timestamp.  Note that we can read latestWalEnd without acquiring the
+		 * mutex that protects it because it is only written to by this
+		 * process (walreceiver).
+		 */
+		if (reportApplyTimestamp ||
+			(WalRcv->latestWalEnd > applyPtr &&
+			 TimestampDifferenceExceeds(lastTimestampSendTime, now,
+										wal_receiver_status_interval * 1000)))
+			lastTimestampSendTime = now;
+		else
+			applyTimestamp = 0;
+	}
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
@@ -1140,6 +1175,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, flushPtr);
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(applyTimestamp));
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	/* Send it */
@@ -1244,18 +1280,40 @@ static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 {
 	WalRcvData *walrcv = WalRcv;
-
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+	bool newHighWalEnd = false;
+
+	static TimestampTz lastRecordedTimestamp = 0;
 
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
+	{
 		walrcv->latestWalEndTime = sendTime;
+		newHighWalEnd = true;
+	}
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * If replay lag sampling is active, remember the upstream server's
+	 * timestamp at the latest WAL end that it has, unless we've already
+	 * done that too recently or the LSN hasn't advanced.  This timestamp
+	 * will be fed back to us by the startup process when it eventually
+	 * replays this LSN, so that we can feed it back to the upstream server
+	 * for replay lag tracking purposes.
+	 */
+	if (replay_lag_sample_interval != -1 &&
+		newHighWalEnd &&
+		sendTime > TimestampTzPlusMilliseconds(lastRecordedTimestamp,
+											   replay_lag_sample_interval))
+	{
+		SetXLogReplayTimestampAtLsn(sendTime, walEnd);
+		lastRecordedTimestamp = sendTime;
+	}
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
@@ -1291,12 +1349,14 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply.  Also used to send periodic messages
+ * which are used to compute pg_stat_replication.replay_lag.
  */
 void
-WalRcvForceReply(void)
+WalRcvForceReply(bool apply_timestamp)
 {
 	WalRcv->force_reply = true;
+	WalRcv->force_reply_apply_timestamp = apply_timestamp;
 	SetLatch(&WalRcv->latch);
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..b6211a5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,15 +1545,29 @@ ProcessStandbyReplyMessage(void)
 	XLogRecPtr	writePtr,
 				flushPtr,
 				applyPtr;
+	int64		applyLagUs;
 	bool		replyRequested;
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz applyTimestamp;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
 	flushPtr = pq_getmsgint64(&reply_message);
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
+	applyTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
 	replyRequested = pq_getmsgbyte(&reply_message);
 
+	/* Compute the apply lag in milliseconds. */
+	if (applyTimestamp == 0)
+		applyLagUs = -1;
+	else
+#ifdef HAVE_INT64_TIMESTAMP
+		applyLagUs = now - applyTimestamp;
+#else
+		applyLagUs = (now - applyTimestamp) * 1000000;
+#endif
+
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
 		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
@@ -1575,6 +1589,8 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (applyLagUs >= 0)
+			walsnd->applyLagUs = applyLagUs;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1971,6 +1987,7 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2753,7 +2770,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	9
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2801,6 +2818,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2815,6 +2833,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		applyLagUs = walsnd->applyLagUs;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2849,6 +2868,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[5] = true;
 			values[5] = LSNGetDatum(apply);
 
+			if (applyLagUs < 0)
+				nulls[6] = true;
+			else
+			{
+				Interval *applyLagInterval = palloc(sizeof(Interval));
+
+				applyLagInterval->month = 0;
+				applyLagInterval->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+				applyLagInterval->time = applyLagUs;
+#else
+				applyLagInterval->time = applyLagUs / 1000000.0;
+#endif
+				nulls[6] = false;
+				values[6] = IntervalPGetDatum(applyLagInterval);
+			}
+
 			/*
 			 * Treat a standby such as a pg_basebackup background process
 			 * which always returns an invalid flush location, as an
@@ -2856,18 +2892,18 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			values[7] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
 			 * informational, not different from priority.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[8] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[8] = CStringGetTextDatum("sync");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[8] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c1d6f05..323d640 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1724,6 +1724,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..c9cab87 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1810,6 +1810,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"replay_lag_sample_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the minimum time between WAL timestamp samples used to estimate replay lag."),
+			NULL,
+			GUC_UNIT_S
+		},
+		&replay_lag_sample_interval,
+		1 * 1000, -1, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf34005..8c940a0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -271,6 +271,8 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#replay_lag_sample_interval = 1s	# min time between timestamps recorded
+					# to estimate replay lag; -1 disables replay lag sampling
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..1be2f34 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,9 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern void SetXLogReplayTimestamp(TimestampTz timestamp);
+extern void SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern TimestampTz GetXLogReplayTimestamp(XLogRecPtr *lsn);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 17ec71d..9e3ce5f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2764,7 +2764,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,23,25}" "{o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cd787c9..9a64bda 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,6 +23,7 @@
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int	replay_lag_sample_interval;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -119,6 +120,9 @@ typedef struct
 	 */
 	bool		force_reply;
 
+	/* include the latest replayed timestamp when replying? */
+	bool		force_reply_apply_timestamp;
+
 	/* set true once conninfo is ready to display (obfuscated pwds etc) */
 	bool		ready_to_display;
 
@@ -176,6 +180,6 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
-extern void WalRcvForceReply(void);
+extern void WalRcvForceReply(bool sendApplyTimestamp);
 
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..4de43e8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -46,6 +46,7 @@ typedef struct WalSnd
 	XLogRecPtr	write;
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
+	int64		applyLagUs;
 
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 93b90fe..20517c9 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -233,9 +233,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 00700f2..061db9b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1790,10 +1790,11 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Thomas Munro (#1)
Re: Measuring replay lag

On Wed, Oct 26, 2016 at 7:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hi hackers,

Here is a new version of my patch to add a replay_lag column to the
pg_stat_replication view (originally proposed as part of a larger
patch set for 9.6[1]), like this:

Thank you for working on this!

postgres=# select application_name, replay_lag from pg_stat_replication;
┌──────────────────┬─────────────────┐
│ application_name │ replay_lag │
├──────────────────┼─────────────────┤
│ replica1 │ 00:00:00.595382 │
│ replica2 │ 00:00:00.598448 │
│ replica3 │ 00:00:00.541597 │
│ replica4 │ 00:00:00.551227 │
└──────────────────┴─────────────────┘
(4 rows)

It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.

replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
postgres(1:36789)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
1000s
(1 row)

Also, I set replay_lag_sample_interval = 500ms, I got 0 by SHOW command.
postgres(1:99850)=# select name, setting, applied from
pg_file_settings where name = 'replay_lag_sample_interval';
name | setting | applied
----------------------------+---------+---------
replay_lag_sample_interval | 500ms | t
(1 row)

postgres(1:99850)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
0
(1 row)

Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:

1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).

I agree with this approach.

In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.

In the last thread[1], Robert Haas wrote:

Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.

Fixed.

In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)

I have added lots of comments.

When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.

Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.

Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.

I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.

By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch.

I think that you need to change sendFeedback() in pg_recvlogical.c and
receivexlog.c as well.

That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?

I plan to post a new "causal reads" patch at some point which will
depend on this, but in any case I think this is a useful feature on
its own. I'd be grateful for any feedback, flames, better ideas etc.
Thanks for reading.

[1] /messages/by-id/CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Masahiko Sawada (#2)
1 attachment(s)
Re: Measuring replay lag

On Tue, Nov 8, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
postgres(1:36789)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
1000s
(1 row)

Oops, fixed.

1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).

I agree with this approach.

Thanks for the feedback.

I think that you need to change sendFeedback() in pg_recvlogical.c and
receivexlog.c as well.

Right, fixed.

Thanks very much for testing! New version attached. I will add this
to the next CF.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replay-lag-v13.patchapplication/octet-stream; name=replay-lag-v13.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..fb39f4c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3256,6 +3256,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-relay-lag-sample-interval" xreflabel="replay_lag_sample_interval">
+      <term><varname>replay_lag_sample_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replay_lag_sample_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls how often a standby should sample replay lag information to
+        send back to the primary or upstream standby while replaying WAL.  The
+        default is 1 second.  Units are milliseconds if not specified.  A
+        value of -1 disables the reporting of replay lag.  Estimated replay lag
+        can be seen in the <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view of the upstream server.
+        This parameter can only be set
+        in the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby_feedback">
       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..a4cb0e4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1381,6 +1381,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..5bcb54b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,8 @@ extern uint32 bootstrap_data_checksum_version;
 #define PROMOTE_SIGNAL_FILE		"promote"
 #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
+/* Size of the circular buffer of timestamped LSNs. */
+#define XLOG_TIMESTAMP_BUFFER_SIZE 8192
 
 /* User-settable parameters */
 int			max_wal_size = 64;	/* 1 GB */
@@ -521,6 +523,26 @@ typedef struct XLogCtlInsert
 } XLogCtlInsert;
 
 /*
+ * A sample associating a timestamp with a given xlog position.
+ */
+typedef struct XLogTimestamp
+{
+	TimestampTz	timestamp;
+	XLogRecPtr	lsn;
+} XLogTimestamp;
+
+/*
+ * A circular buffer of LSNs and associated timestamps.  The buffer is empty
+ * when read_head == write_head.
+ */
+typedef struct XLogTimestampBuffer
+{
+	uint32			read_head;
+	uint32			write_head;
+	XLogTimestamp	buffer[XLOG_TIMESTAMP_BUFFER_SIZE];
+} XLogTimestampBuffer;
+
+/*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
@@ -638,6 +660,12 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/* timestamp from the most recently applied record associated with a timestamp. */
+	TimestampTz lastReplayedTimestamp;
+
+	/* a buffer of upstream timestamps for WAL that is not yet applied. */
+	XLogTimestampBuffer timestamps;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -5981,6 +6009,44 @@ CheckRequiredParameterValues(void)
 }
 
 /*
+ * Called by the startup process after it has replayed up to 'lsn'.  Checks
+ * for timestamps associated with WAL positions that have now been replayed.
+ * If any are found, the latest such timestamp found is written to
+ * '*timestamp'.  Returns the new buffer read head position, which the caller
+ * should write into XLogCtl->timestamps.read_head while holding info_lck.
+ */
+static uint32
+CheckForReplayedTimestamps(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	uint32 read_head;
+
+	/*
+	 * It's OK to access timestamps.read_head without any kind synchronization
+	 * because this process is the only one to write to it.
+	 */
+	Assert(AmStartupProcess());
+	read_head = XLogCtl->timestamps.read_head;
+
+	/*
+	 * It's OK to access write_head without interlocking because it's an
+	 * aligned 32 bit value which we can read atomically on all supported
+	 * platforms to get some recent value, not a torn/garbage value.
+	 * Furthermore we must see a value that is at least as recent as any WAL
+	 * that we have replayed, because walreceiver calls
+	 * SetXLogReplayTimestampAtLsn before passing the corresponding WAL data
+	 * to the recovery process.
+	 */
+	while (read_head != XLogCtl->timestamps.write_head &&
+		   XLogCtl->timestamps.buffer[read_head].lsn <= lsn)
+	{
+		*timestamp = XLogCtl->timestamps.buffer[read_head].timestamp;
+		read_head = (read_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	return read_head;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
@@ -6799,6 +6865,8 @@ StartupXLOG(void)
 			do
 			{
 				bool		switchedTLI = false;
+				TimestampTz	replayed_timestamp = 0;
+				uint32		timestamp_read_head;
 
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
@@ -6952,24 +7020,34 @@ StartupXLOG(void)
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
+				/* Check if we have replayed a timestamped WAL position */
+				timestamp_read_head =
+					CheckForReplayedTimestamps(EndRecPtr, &replayed_timestamp);
+
 				/*
-				 * Update lastReplayedEndRecPtr after this record has been
-				 * successfully replayed.
+				 * Update lastReplayedEndRecPtr and lastReplayedTimestamp
+				 * after this record has been successfully replayed.
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->timestamps.read_head = timestamp_read_head;
+				if (replayed_timestamp != 0)
+					XLogCtl->lastReplayedTimestamp = replayed_timestamp;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
-				 * lastReplayedEndRecPtr and sends a reply to the master.
+				 * lastReplayedEndRecPtr and sends a reply to the master.  We
+				 * also wake it if we have replayed a WAL position that has
+				 * an associated timestamp so that the upstream server can
+				 * measure our replay lag.
 				 */
-				if (doRequestWalReceiverReply)
+				if (doRequestWalReceiverReply || replayed_timestamp != 0)
 				{
 					doRequestWalReceiverReply = false;
-					WalRcvForceReply();
+					WalRcvForceReply(replayed_timestamp != 0);
 				}
 
 				/* Remember this record as the last-applied one */
@@ -11750,3 +11828,81 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Record the timestamp that is associated with a WAL position.
+ *
+ * This is called by walreceiver on standby servers when new messages arrive,
+ * using a timestamp and the latest known WAL position from the upstream
+ * server.  The timestamp will be sent back to the upstream server via
+ * walreceiver when the recovery process has applied the WAL position.  The
+ * upstream server can then compute the elapsed time to estimate the replay
+ * lag.
+ */
+void
+SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
+{
+	Assert(AmWalReceiverProcess());
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn == XLogCtl->lastReplayedEndRecPtr)
+	{
+		/*
+		 * That is the last replayed LSN: we are fully replayed, so we can
+		 * update the replay timestamp immediately.
+		 */
+		XLogCtl->lastReplayedTimestamp = timestamp;
+	}
+	else
+	{
+		/*
+		 * There is WAL still to be applied.  We will associate the timestamp
+		 * with this WAL position and wait for it to be replayed.  We add it
+		 * at the 'write' end of the circular buffer of LSN/timestamp
+		 * mappings, which the replay loop will eventually read.
+		 */
+		uint32 write_head = XLogCtl->timestamps.write_head;
+		uint32 new_write_head = (write_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+
+		if (new_write_head == XLogCtl->timestamps.read_head)
+		{
+			/*
+			 * The buffer is full, so we'll rewind and overwrite the most
+			 * recent sample.  Overwriting the most recent sample means that
+			 * if we're not replaying fast enough and the buffer fills up,
+			 * we'll effectively lower the sampling rate.
+			 */
+			new_write_head = write_head;
+			write_head = (write_head - 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+		}
+
+		XLogCtl->timestamps.buffer[write_head].lsn = lsn;
+		XLogCtl->timestamps.buffer[write_head].timestamp = timestamp;
+		XLogCtl->timestamps.write_head = new_write_head;
+	}
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Get the timestamp for the most recently applied WAL record that carried a
+ * timestamp from the upstream server, and also the most recently applied LSN.
+ * (Note that the timestamp and the LSN don't necessarily relate to the same
+ * record.)
+ *
+ * This is similar to GetLatestXTime, except that it is advanced when WAL
+ * positions recorded with SetXLogReplayTimestampAtLsn have been applied,
+ * rather than commit records.
+ */
+TimestampTz
+GetXLogReplayTimestamp(XLogRecPtr *lsn)
+{
+	TimestampTz result;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn)
+		*lsn = XLogCtl->lastReplayedEndRecPtr;
+	result = XLogCtl->lastReplayedTimestamp;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e011af1..a070844 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -679,6 +679,7 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2bb3dce..f5a10e9 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,6 +73,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			replay_lag_sample_interval;
 
 /* libpqreceiver hooks to these when loaded */
 walrcv_connect_type walrcv_connect = NULL;
@@ -145,7 +146,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool includeApplyTimestamp);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 
@@ -457,7 +458,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -494,6 +495,8 @@ WalReceiverMain(void)
 					ResetLatch(&walrcv->latch);
 					if (walrcv->force_reply)
 					{
+						bool timestamp = walrcv->force_reply_apply_timestamp;
+
 						/*
 						 * The recovery process has asked us to send apply
 						 * feedback now.  Make sure the flag is really set to
@@ -501,8 +504,9 @@ WalReceiverMain(void)
 						 * we don't miss a new request for a reply.
 						 */
 						walrcv->force_reply = false;
+						walrcv->force_reply_apply_timestamp = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, timestamp);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -560,7 +564,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -911,7 +915,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1074,7 +1078,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1092,15 +1096,18 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If 'reportApplyTimestamp' is true, the latest apply timestamp is included.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool reportApplyTimestamp)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
+	TimestampTz applyTimestamp = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1132,7 +1139,35 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyTimestamp = GetXLogReplayTimestamp(&applyPtr);
+
+	/* Decide whether to send an apply timestamp for replay lag estimation. */
+	if (replay_lag_sample_interval != -1)
+	{
+		static TimestampTz lastTimestampSendTime = 0;
+
+		/*
+		 * Only send an apply timestamp if we were explicitly asked to by the
+		 * recovery process or if replay lag sampling is active but the
+		 * recovery process seems to be stuck.
+		 *
+		 * If we haven't heard from the recovery process in a time exceeding
+		 * wal_receiver_status_interval and yet it has not applied the highest
+		 * LSN we've heard about, then we want to resend the last replayed
+		 * timestamp we have; otherwise we zero it out and wait for the
+		 * recovery process to wake us when it has set a new accurate replay
+		 * timestamp.  Note that we can read latestWalEnd without acquiring the
+		 * mutex that protects it because it is only written to by this
+		 * process (walreceiver).
+		 */
+		if (reportApplyTimestamp ||
+			(WalRcv->latestWalEnd > applyPtr &&
+			 TimestampDifferenceExceeds(lastTimestampSendTime, now,
+										wal_receiver_status_interval * 1000)))
+			lastTimestampSendTime = now;
+		else
+			applyTimestamp = 0;
+	}
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
@@ -1140,6 +1175,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, flushPtr);
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(applyTimestamp));
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	/* Send it */
@@ -1244,18 +1280,40 @@ static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 {
 	WalRcvData *walrcv = WalRcv;
-
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+	bool newHighWalEnd = false;
+
+	static TimestampTz lastRecordedTimestamp = 0;
 
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
+	{
 		walrcv->latestWalEndTime = sendTime;
+		newHighWalEnd = true;
+	}
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * If replay lag sampling is active, remember the upstream server's
+	 * timestamp at the latest WAL end that it has, unless we've already
+	 * done that too recently or the LSN hasn't advanced.  This timestamp
+	 * will be fed back to us by the startup process when it eventually
+	 * replays this LSN, so that we can feed it back to the upstream server
+	 * for replay lag tracking purposes.
+	 */
+	if (replay_lag_sample_interval != -1 &&
+		newHighWalEnd &&
+		sendTime > TimestampTzPlusMilliseconds(lastRecordedTimestamp,
+											   replay_lag_sample_interval))
+	{
+		SetXLogReplayTimestampAtLsn(sendTime, walEnd);
+		lastRecordedTimestamp = sendTime;
+	}
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
@@ -1291,12 +1349,14 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply.  Also used to send periodic messages
+ * which are used to compute pg_stat_replication.replay_lag.
  */
 void
-WalRcvForceReply(void)
+WalRcvForceReply(bool apply_timestamp)
 {
 	WalRcv->force_reply = true;
+	WalRcv->force_reply_apply_timestamp = apply_timestamp;
 	SetLatch(&WalRcv->latch);
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..b6211a5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,15 +1545,29 @@ ProcessStandbyReplyMessage(void)
 	XLogRecPtr	writePtr,
 				flushPtr,
 				applyPtr;
+	int64		applyLagUs;
 	bool		replyRequested;
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz applyTimestamp;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
 	flushPtr = pq_getmsgint64(&reply_message);
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
+	applyTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
 	replyRequested = pq_getmsgbyte(&reply_message);
 
+	/* Compute the apply lag in milliseconds. */
+	if (applyTimestamp == 0)
+		applyLagUs = -1;
+	else
+#ifdef HAVE_INT64_TIMESTAMP
+		applyLagUs = now - applyTimestamp;
+#else
+		applyLagUs = (now - applyTimestamp) * 1000000;
+#endif
+
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
 		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
@@ -1575,6 +1589,8 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (applyLagUs >= 0)
+			walsnd->applyLagUs = applyLagUs;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1971,6 +1987,7 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2753,7 +2770,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	9
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2801,6 +2818,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2815,6 +2833,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		applyLagUs = walsnd->applyLagUs;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2849,6 +2868,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[5] = true;
 			values[5] = LSNGetDatum(apply);
 
+			if (applyLagUs < 0)
+				nulls[6] = true;
+			else
+			{
+				Interval *applyLagInterval = palloc(sizeof(Interval));
+
+				applyLagInterval->month = 0;
+				applyLagInterval->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+				applyLagInterval->time = applyLagUs;
+#else
+				applyLagInterval->time = applyLagUs / 1000000.0;
+#endif
+				nulls[6] = false;
+				values[6] = IntervalPGetDatum(applyLagInterval);
+			}
+
 			/*
 			 * Treat a standby such as a pg_basebackup background process
 			 * which always returns an invalid flush location, as an
@@ -2856,18 +2892,18 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			values[7] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
 			 * informational, not different from priority.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[8] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[8] = CStringGetTextDatum("sync");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[8] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c1d6f05..323d640 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1724,6 +1724,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index da74f00..7551a0e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1810,6 +1810,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"replay_lag_sample_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the minimum time between WAL timestamp samples used to estimate replay lag."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&replay_lag_sample_interval,
+		1 * 1000, -1, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..53d9f4b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -271,6 +271,8 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#replay_lag_sample_interval = 1s	# min time between timestamps recorded
+					# to estimate replay lag; -1 disables replay lag sampling
 
 
 #------------------------------------------------------------------------------
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..9753648 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -111,7 +111,7 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	/*
@@ -142,6 +142,8 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 4382e5d..8e89627 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -321,7 +321,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -337,6 +337,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..1be2f34 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,9 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern void SetXLogReplayTimestamp(TimestampTz timestamp);
+extern void SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern TimestampTz GetXLogReplayTimestamp(XLogRecPtr *lsn);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 047a1ce..6c49713 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2766,7 +2766,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,23,25}" "{o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cd787c9..9a64bda 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,6 +23,7 @@
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int	replay_lag_sample_interval;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -119,6 +120,9 @@ typedef struct
 	 */
 	bool		force_reply;
 
+	/* include the latest replayed timestamp when replying? */
+	bool		force_reply_apply_timestamp;
+
 	/* set true once conninfo is ready to display (obfuscated pwds etc) */
 	bool		ready_to_display;
 
@@ -176,6 +180,6 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
-extern void WalRcvForceReply(void);
+extern void WalRcvForceReply(bool sendApplyTimestamp);
 
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..4de43e8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -46,6 +46,7 @@ typedef struct WalSnd
 	XLogRecPtr	write;
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
+	int64		applyLagUs;
 
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 93b90fe..20517c9 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -233,9 +233,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 031e8c2..8fbdeae 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1804,10 +1804,11 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#4Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Thomas Munro (#3)
Re: Measuring replay lag

On 11/22/16 4:27 AM, Thomas Munro wrote:

Thanks very much for testing! New version attached. I will add this
to the next CF.

I don't see it there yet.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#1)
Re: Measuring replay lag

On 26 October 2016 at 11:34, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.

Why not just send back the lag as calculated by max_standby_streaming_delay?
I.e. at the end of replay of each chunk record the current delay in
shmem, then send it back periodically.

If we have two methods of calculation it would be confusing.

Admittedly the approach here is the same one I advocated a some years
back when Robert and I were discussing time delayed standbys.

Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:

1. The lag is measured in time, not LSN difference.
2. The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3. The lag is updated even between commits (during large data loads etc).

Yes, good reasons.

In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening. I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.

In the last thread[1], Robert Haas wrote:

Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.

Fixed.

In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)

I have added lots of comments.

When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.

Done, by overwriting the newest sample rather than the oldest if the
buffer is full. That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.

Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.

I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this. If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both). I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.

By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch. That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings. One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps. Thoughts?

-1 for adding anything to the WAL for this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Peter Eisentraut (#4)
1 attachment(s)
Re: Measuring replay lag

On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 11/22/16 4:27 AM, Thomas Munro wrote:

Thanks very much for testing! New version attached. I will add this
to the next CF.

I don't see it there yet.

Thanks for the reminder. Added here: https://commitfest.postgresql.org/12/920/

Here's a rebased patch.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replay-lag-v14.patchapplication/octet-stream; name=replay-lag-v14.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b614b6..36e00e0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3282,6 +3282,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-relay-lag-sample-interval" xreflabel="replay_lag_sample_interval">
+      <term><varname>replay_lag_sample_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replay_lag_sample_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls how often a standby should sample replay lag information to
+        send back to the primary or upstream standby while replaying WAL.  The
+        default is 1 second.  Units are milliseconds if not specified.  A
+        value of -1 disables the reporting of replay lag.  Estimated replay lag
+        can be seen in the <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view of the upstream server.
+        This parameter can only be set
+        in the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby_feedback">
       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5b58d2e..79d2d95 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1401,6 +1401,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa9ee5a..739e45e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,8 @@ extern uint32 bootstrap_data_checksum_version;
 #define PROMOTE_SIGNAL_FILE		"promote"
 #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
+/* Size of the circular buffer of timestamped LSNs. */
+#define XLOG_TIMESTAMP_BUFFER_SIZE 8192
 
 /* User-settable parameters */
 int			max_wal_size = 64;	/* 1 GB */
@@ -520,6 +522,26 @@ typedef struct XLogCtlInsert
 } XLogCtlInsert;
 
 /*
+ * A sample associating a timestamp with a given xlog position.
+ */
+typedef struct XLogTimestamp
+{
+	TimestampTz	timestamp;
+	XLogRecPtr	lsn;
+} XLogTimestamp;
+
+/*
+ * A circular buffer of LSNs and associated timestamps.  The buffer is empty
+ * when read_head == write_head.
+ */
+typedef struct XLogTimestampBuffer
+{
+	uint32			read_head;
+	uint32			write_head;
+	XLogTimestamp	buffer[XLOG_TIMESTAMP_BUFFER_SIZE];
+} XLogTimestampBuffer;
+
+/*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
@@ -637,6 +659,12 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/* timestamp from the most recently applied record associated with a timestamp. */
+	TimestampTz lastReplayedTimestamp;
+
+	/* a buffer of upstream timestamps for WAL that is not yet applied. */
+	XLogTimestampBuffer timestamps;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -5976,6 +6004,44 @@ CheckRequiredParameterValues(void)
 }
 
 /*
+ * Called by the startup process after it has replayed up to 'lsn'.  Checks
+ * for timestamps associated with WAL positions that have now been replayed.
+ * If any are found, the latest such timestamp found is written to
+ * '*timestamp'.  Returns the new buffer read head position, which the caller
+ * should write into XLogCtl->timestamps.read_head while holding info_lck.
+ */
+static uint32
+CheckForReplayedTimestamps(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	uint32 read_head;
+
+	/*
+	 * It's OK to access timestamps.read_head without any kind synchronization
+	 * because this process is the only one to write to it.
+	 */
+	Assert(AmStartupProcess());
+	read_head = XLogCtl->timestamps.read_head;
+
+	/*
+	 * It's OK to access write_head without interlocking because it's an
+	 * aligned 32 bit value which we can read atomically on all supported
+	 * platforms to get some recent value, not a torn/garbage value.
+	 * Furthermore we must see a value that is at least as recent as any WAL
+	 * that we have replayed, because walreceiver calls
+	 * SetXLogReplayTimestampAtLsn before passing the corresponding WAL data
+	 * to the recovery process.
+	 */
+	while (read_head != XLogCtl->timestamps.write_head &&
+		   XLogCtl->timestamps.buffer[read_head].lsn <= lsn)
+	{
+		*timestamp = XLogCtl->timestamps.buffer[read_head].timestamp;
+		read_head = (read_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	return read_head;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
@@ -6794,6 +6860,8 @@ StartupXLOG(void)
 			do
 			{
 				bool		switchedTLI = false;
+				TimestampTz	replayed_timestamp = 0;
+				uint32		timestamp_read_head;
 
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
@@ -6947,24 +7015,34 @@ StartupXLOG(void)
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
+				/* Check if we have replayed a timestamped WAL position */
+				timestamp_read_head =
+					CheckForReplayedTimestamps(EndRecPtr, &replayed_timestamp);
+
 				/*
-				 * Update lastReplayedEndRecPtr after this record has been
-				 * successfully replayed.
+				 * Update lastReplayedEndRecPtr and lastReplayedTimestamp
+				 * after this record has been successfully replayed.
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->timestamps.read_head = timestamp_read_head;
+				if (replayed_timestamp != 0)
+					XLogCtl->lastReplayedTimestamp = replayed_timestamp;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
-				 * lastReplayedEndRecPtr and sends a reply to the master.
+				 * lastReplayedEndRecPtr and sends a reply to the master.  We
+				 * also wake it if we have replayed a WAL position that has
+				 * an associated timestamp so that the upstream server can
+				 * measure our replay lag.
 				 */
-				if (doRequestWalReceiverReply)
+				if (doRequestWalReceiverReply || replayed_timestamp != 0)
 				{
 					doRequestWalReceiverReply = false;
-					WalRcvForceReply();
+					WalRcvForceReply(replayed_timestamp != 0);
 				}
 
 				/* Remember this record as the last-applied one */
@@ -11745,3 +11823,81 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Record the timestamp that is associated with a WAL position.
+ *
+ * This is called by walreceiver on standby servers when new messages arrive,
+ * using a timestamp and the latest known WAL position from the upstream
+ * server.  The timestamp will be sent back to the upstream server via
+ * walreceiver when the recovery process has applied the WAL position.  The
+ * upstream server can then compute the elapsed time to estimate the replay
+ * lag.
+ */
+void
+SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
+{
+	Assert(AmWalReceiverProcess());
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn == XLogCtl->lastReplayedEndRecPtr)
+	{
+		/*
+		 * That is the last replayed LSN: we are fully replayed, so we can
+		 * update the replay timestamp immediately.
+		 */
+		XLogCtl->lastReplayedTimestamp = timestamp;
+	}
+	else
+	{
+		/*
+		 * There is WAL still to be applied.  We will associate the timestamp
+		 * with this WAL position and wait for it to be replayed.  We add it
+		 * at the 'write' end of the circular buffer of LSN/timestamp
+		 * mappings, which the replay loop will eventually read.
+		 */
+		uint32 write_head = XLogCtl->timestamps.write_head;
+		uint32 new_write_head = (write_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+
+		if (new_write_head == XLogCtl->timestamps.read_head)
+		{
+			/*
+			 * The buffer is full, so we'll rewind and overwrite the most
+			 * recent sample.  Overwriting the most recent sample means that
+			 * if we're not replaying fast enough and the buffer fills up,
+			 * we'll effectively lower the sampling rate.
+			 */
+			new_write_head = write_head;
+			write_head = (write_head - 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+		}
+
+		XLogCtl->timestamps.buffer[write_head].lsn = lsn;
+		XLogCtl->timestamps.buffer[write_head].timestamp = timestamp;
+		XLogCtl->timestamps.write_head = new_write_head;
+	}
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Get the timestamp for the most recently applied WAL record that carried a
+ * timestamp from the upstream server, and also the most recently applied LSN.
+ * (Note that the timestamp and the LSN don't necessarily relate to the same
+ * record.)
+ *
+ * This is similar to GetLatestXTime, except that it is advanced when WAL
+ * positions recorded with SetXLogReplayTimestampAtLsn have been applied,
+ * rather than commit records.
+ */
+TimestampTz
+GetXLogReplayTimestamp(XLogRecPtr *lsn)
+{
+	TimestampTz result;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn)
+		*lsn = XLogCtl->lastReplayedEndRecPtr;
+	result = XLogCtl->lastReplayedTimestamp;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48e7c4b..e0e45fa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -685,6 +685,7 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..9cf9f4c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,6 +73,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			replay_lag_sample_interval;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
@@ -138,7 +139,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool includeApplyTimestamp);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 
@@ -456,7 +457,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -493,6 +494,8 @@ WalReceiverMain(void)
 					ResetLatch(walrcv->latch);
 					if (walrcv->force_reply)
 					{
+						bool timestamp = walrcv->force_reply_apply_timestamp;
+
 						/*
 						 * The recovery process has asked us to send apply
 						 * feedback now.  Make sure the flag is really set to
@@ -500,8 +503,9 @@ WalReceiverMain(void)
 						 * we don't miss a new request for a reply.
 						 */
 						walrcv->force_reply = false;
+						walrcv->force_reply_apply_timestamp = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, timestamp);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -559,7 +563,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -911,7 +915,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1074,7 +1078,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1092,15 +1096,18 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If 'reportApplyTimestamp' is true, the latest apply timestamp is included.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool reportApplyTimestamp)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
+	TimestampTz applyTimestamp = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1132,7 +1139,35 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyTimestamp = GetXLogReplayTimestamp(&applyPtr);
+
+	/* Decide whether to send an apply timestamp for replay lag estimation. */
+	if (replay_lag_sample_interval != -1)
+	{
+		static TimestampTz lastTimestampSendTime = 0;
+
+		/*
+		 * Only send an apply timestamp if we were explicitly asked to by the
+		 * recovery process or if replay lag sampling is active but the
+		 * recovery process seems to be stuck.
+		 *
+		 * If we haven't heard from the recovery process in a time exceeding
+		 * wal_receiver_status_interval and yet it has not applied the highest
+		 * LSN we've heard about, then we want to resend the last replayed
+		 * timestamp we have; otherwise we zero it out and wait for the
+		 * recovery process to wake us when it has set a new accurate replay
+		 * timestamp.  Note that we can read latestWalEnd without acquiring the
+		 * mutex that protects it because it is only written to by this
+		 * process (walreceiver).
+		 */
+		if (reportApplyTimestamp ||
+			(WalRcv->latestWalEnd > applyPtr &&
+			 TimestampDifferenceExceeds(lastTimestampSendTime, now,
+										wal_receiver_status_interval * 1000)))
+			lastTimestampSendTime = now;
+		else
+			applyTimestamp = 0;
+	}
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
@@ -1140,6 +1175,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, flushPtr);
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(applyTimestamp));
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	/* Send it */
@@ -1244,18 +1280,40 @@ static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 {
 	WalRcvData *walrcv = WalRcv;
-
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+	bool newHighWalEnd = false;
+
+	static TimestampTz lastRecordedTimestamp = 0;
 
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
+	{
 		walrcv->latestWalEndTime = sendTime;
+		newHighWalEnd = true;
+	}
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * If replay lag sampling is active, remember the upstream server's
+	 * timestamp at the latest WAL end that it has, unless we've already
+	 * done that too recently or the LSN hasn't advanced.  This timestamp
+	 * will be fed back to us by the startup process when it eventually
+	 * replays this LSN, so that we can feed it back to the upstream server
+	 * for replay lag tracking purposes.
+	 */
+	if (replay_lag_sample_interval != -1 &&
+		newHighWalEnd &&
+		sendTime > TimestampTzPlusMilliseconds(lastRecordedTimestamp,
+											   replay_lag_sample_interval))
+	{
+		SetXLogReplayTimestampAtLsn(sendTime, walEnd);
+		lastRecordedTimestamp = sendTime;
+	}
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
@@ -1291,12 +1349,14 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply.  Also used to send periodic messages
+ * which are used to compute pg_stat_replication.replay_lag.
  */
 void
-WalRcvForceReply(void)
+WalRcvForceReply(bool apply_timestamp)
 {
 	WalRcv->force_reply = true;
+	WalRcv->force_reply_apply_timestamp = apply_timestamp;
 	if (WalRcv->latch)
 		SetLatch(WalRcv->latch);
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d80bcc0..0782d78 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1553,15 +1553,29 @@ ProcessStandbyReplyMessage(void)
 	XLogRecPtr	writePtr,
 				flushPtr,
 				applyPtr;
+	int64		applyLagUs;
 	bool		replyRequested;
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz applyTimestamp;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
 	flushPtr = pq_getmsgint64(&reply_message);
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
+	applyTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
 	replyRequested = pq_getmsgbyte(&reply_message);
 
+	/* Compute the apply lag in milliseconds. */
+	if (applyTimestamp == 0)
+		applyLagUs = -1;
+	else
+#ifdef HAVE_INT64_TIMESTAMP
+		applyLagUs = now - applyTimestamp;
+#else
+		applyLagUs = (now - applyTimestamp) * 1000000;
+#endif
+
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
 		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
@@ -1583,6 +1597,8 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (applyLagUs >= 0)
+			walsnd->applyLagUs = applyLagUs;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1979,6 +1995,7 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2761,7 +2778,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	9
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2809,6 +2826,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2823,6 +2841,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		applyLagUs = walsnd->applyLagUs;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2857,6 +2876,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[5] = true;
 			values[5] = LSNGetDatum(apply);
 
+			if (applyLagUs < 0)
+				nulls[6] = true;
+			else
+			{
+				Interval *applyLagInterval = palloc(sizeof(Interval));
+
+				applyLagInterval->month = 0;
+				applyLagInterval->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+				applyLagInterval->time = applyLagUs;
+#else
+				applyLagInterval->time = applyLagUs / 1000000.0;
+#endif
+				nulls[6] = false;
+				values[6] = IntervalPGetDatum(applyLagInterval);
+			}
+
 			/*
 			 * Treat a standby such as a pg_basebackup background process
 			 * which always returns an invalid flush location, as an
@@ -2864,18 +2900,18 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			values[7] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
 			 * informational, not different from priority.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[8] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[8] = CStringGetTextDatum("sync");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[8] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c1d6f05..323d640 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1724,6 +1724,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a025117..b1af028 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1810,6 +1810,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"replay_lag_sample_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the minimum time between WAL timestamp samples used to estimate replay lag."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&replay_lag_sample_interval,
+		1 * 1000, -1, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7f9acfd..bf298f2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -270,6 +270,8 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#replay_lag_sample_interval = 1s	# min time between timestamps recorded
+					# to estimate replay lag; -1 disables replay lag sampling
 
 
 #------------------------------------------------------------------------------
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..9753648 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -111,7 +111,7 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	/*
@@ -142,6 +142,8 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 4382e5d..8e89627 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -321,7 +321,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -337,6 +337,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..1be2f34 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -237,6 +237,9 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern void SetXLogReplayTimestamp(TimestampTz timestamp);
+extern void SetXLogReplayTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern TimestampTz GetXLogReplayTimestamp(XLogRecPtr *lsn);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index cd7b909..b565dd8 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2768,7 +2768,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,23,25}" "{o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 28dc1fc..be25758 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,6 +23,7 @@
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int	replay_lag_sample_interval;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -119,6 +120,9 @@ typedef struct
 	 */
 	bool		force_reply;
 
+	/* include the latest replayed timestamp when replying? */
+	bool		force_reply_apply_timestamp;
+
 	/* set true once conninfo is ready to display (obfuscated pwds etc) */
 	bool		ready_to_display;
 
@@ -208,6 +212,6 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
-extern void WalRcvForceReply(void);
+extern void WalRcvForceReply(bool sendApplyTimestamp);
 
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..4de43e8 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -46,6 +46,7 @@ typedef struct WalSnd
 	XLogRecPtr	write;
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
+	int64		applyLagUs;
 
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 93b90fe..20517c9 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -233,9 +233,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5314b9c..d59956f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1809,10 +1809,11 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#7Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#5)
Re: Measuring replay lag

On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 26 October 2016 at 11:34, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys. That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced. The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer. When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field. The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.

Why not just send back the lag as calculated by max_standby_streaming_delay?
I.e. at the end of replay of each chunk record the current delay in
shmem, then send it back periodically.

If we have two methods of calculation it would be confusing.

Hmm. If I understand correctly, GetStandbyLimitTime is measuring
something a bit different: it computes how long it has been since the
recovery process received the chunk that it's currently trying to
apply, most interestingly in the case where we are waiting due to
conflicts. It doesn't include time in walsender, on the network, in
walreceiver, or writing and flushing and reading before it arrives in
the recovery process. If I'm reading it correctly, it only updates
XLogReceiptTime when it's completely caught up applying all earlier
chunks, so when it falls behind, its measure of lag has a growing-only
phase and a reset that can only be triggered by catching up to the
latest chunk. That seems OK for its intended purpose of putting a cap
on the delay introduced by conflicts. But that's not what I'm trying
to provide here.

The purpose of this proposal is to show the replay_lag as judged by
the sending server: in the case of a primary server, that is an
indication of how commits done here will take to show up to users over
there, and how long COMMIT will take with remote_apply will take to
come back. It measures the WAL's whole journey, and does so in a
smooth way that shows accurate information even if the standby never
quite catches up during long periods.

Example 1: Suppose I have two servers right next each other, and the
primary server has periods of high activity which exceed the standby's
replay rate, perhaps because of slower/busier hardware, or because of
conflicts with other queries, or because our single-core 'REDO' can't
always keep up with multi-core 'DO'. By the logic of
max_standby_streaming_delay, if it never catches up to the latest
chunk but remains a fluctuating number of chunks behind, then AIUI the
standby will compute a constantly increasing lag. By my logic, the
primary will tell you quite accurately how far behind the standby's
recovery is at regular intervals, showing replay_lag fluctuating up
and down as appropriate, even if it never quite catches up. It can do
that because it has a buffer full of regularly spaced out samples to
work through, and even if you exceed the buffer size (8192 seconds'
worth by default) it'll just increase the interval between samples.

Example 2: Suppose I have servers on opposite sides of the world with
a ping time of 300ms. By the logic used for
max_standby_streaming_delay, the lag computed by the standby would be
close to zero when there is no concurrent activity to conflict with.
I don't think that's what users other than the recovery-conflict
resolution code want to know. By my logic, replay_lag computed by the
primary would show 300ms + a tiny bit more, which is how long it takes
for committed transactions to be visible to user queries on the
standby and for us to know that that is the case. That's interesting
because it tells you how long synchronous_commit = remote_apply would
make you wait (if that server is waited for according to syncrep
config).

In summary, the max_standby_streaming_delay approach only measures
activity inside the recovery process on the standby, and only uses a
single variable for timestamp tracking, so although it's semi-related
it's not what I wanted to show.

(I suppose there might be an argument that max_standby_streaming_delay
should also track received-on-standby-time for each sampled LSN in a
circular buffer, and then use that information to implement
max_standby_streaming_delay more fairly. We only need to cancel
queries that conflict with WAL records that have truly been waiting
max_standby_streaming_delay since receive time, instead of cancelling
everything that conflicts with recovery until we're caught up to the
last chunk as we do today as soon as max_standby_streaming_delay is
exceeded while trying to apply *any* WAL record. This may not make
any sense or be worth doing, just an idea...)

Admittedly the approach here is the same one I advocated a some years
back when Robert and I were discussing time delayed standbys.

That is reassuring!

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Thomas Munro (#6)
Re: Measuring replay lag

On Mon, Dec 19, 2016 at 8:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Dec 19, 2016 at 4:03 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 11/22/16 4:27 AM, Thomas Munro wrote:

Thanks very much for testing! New version attached. I will add this
to the next CF.

I don't see it there yet.

Thanks for the reminder. Added here: https://commitfest.postgresql.org/12/920/

Here's a rebased patch.

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Fujii Masao (#8)
Re: Measuring replay lag

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]/messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.

Or you could fix that by setting the XACT_COMPLETION_APPLY_FEEDBACK
bit in the xl_xinfo.xinfo for selected transactions, as a way to ask
the standby to send a reply when that commit record is applied, but
that only works for commit records. One of my goals was to be able to
report lag accurately even between commits (very large data load
transactions etc).

Or you could fix that by sending a list of 'interesting LSNs' to the
standby, as a way to ask it to send a reply when those LSNs are
applied. Then you'd need a circular buffer of (LSN, time) pairs in
the primary AND a circular buffer of LSNs in the standby to remember
which locations should generate a reply. This doesn't seem to be an
improvement.

That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.

As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.

For the record, these replies are only sent approximately every
replay_lag_sample_interval (with variation depending on replay speed)
and are only 42 bytes with the new field added.

[1]: /messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#9)
1 attachment(s)
Re: Measuring replay lag

On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work.

Here is an experimental version that reports the write, flush and
apply lag separately as requested. This is done with three separate
(lsn, timestamp) buffers on the standby side. The GUC is now called
replication_lag_sample_interval. Not tested much yet.

postgres=# select application_name, write_lag, flush_lag, replay_lag
from pg_stat_replication ;
application_name | write_lag | flush_lag | replay_lag
------------------+-----------------+-----------------+-----------------
replica1 | 00:00:00.032408 | 00:00:00.032409 | 00:00:00.697858
replica2 | 00:00:00.032579 | 00:00:00.03258 | 00:00:00.551125
replica3 | 00:00:00.033686 | 00:00:00.033687 | 00:00:00.670571
replica4 | 00:00:00.032861 | 00:00:00.032862 | 00:00:00.521902
(4 rows)

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replay-lag-v15.patchapplication/octet-stream; name=replay-lag-v15.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d7b3bf..b894e31 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3310,6 +3310,26 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-replication-lag-sample-interval" xreflabel="replication_lag_sample_interval">
+      <term><varname>replication_lag_sample_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replication_lag_sample_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls how often a standby should sample timestamps from upstream to
+        send back to the primary or upstream standby after writing, flushing
+        and replaying WAL.  The default is 1 second.  Units are milliseconds if
+        not specified.  A value of -1 disables the reporting of replication
+        lag.  Estimated lag can be seen in the <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view of the upstream server.
+        This parameter can only be set
+        in the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby_feedback">
       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1545f03..a422ac0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1405,6 +1405,24 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be written on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be flushed on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f8ffa5c..5a5e5cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,8 @@ extern uint32 bootstrap_data_checksum_version;
 #define PROMOTE_SIGNAL_FILE		"promote"
 #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
+/* Size of the circular buffer of timestamped LSNs. */
+#define XLOG_TIMESTAMP_BUFFER_SIZE 8192
 
 /* User-settable parameters */
 int			max_wal_size = 64;	/* 1 GB */
@@ -530,6 +532,26 @@ typedef struct XLogCtlInsert
 } XLogCtlInsert;
 
 /*
+ * A sample associating a timestamp with a given xlog position.
+ */
+typedef struct XLogTimestamp
+{
+	TimestampTz	timestamp;
+	XLogRecPtr	lsn;
+} XLogTimestamp;
+
+/*
+ * A circular buffer of LSNs and associated timestamps.  The buffer is empty
+ * when read_head == write_head.
+ */
+typedef struct XLogTimestampBuffer
+{
+	uint32			read_head;
+	uint32			write_head;
+	XLogTimestamp	buffer[XLOG_TIMESTAMP_BUFFER_SIZE];
+} XLogTimestampBuffer;
+
+/*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
@@ -648,6 +670,14 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/* timestamp from the most recently applied record associated with a timestamp. */
+	TimestampTz lastReplayedTimestamp;
+
+	/* buffers of timestamps for WAL that is not yet written/flushed/applied. */
+	XLogTimestampBuffer writeTimestamps;
+	XLogTimestampBuffer flushTimestamps;
+	XLogTimestampBuffer applyTimestamps;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -6006,6 +6036,96 @@ CheckRequiredParameterValues(void)
 }
 
 /*
+ * Read and consume all records from 'buffer' whose position is <= 'lsn'.
+ * Return true if any such records are found, and write the latest timestamp
+ * found into *timestamp.  Write the new read head position into *read_head,
+ * so that the caller can store it with appropriate locking.
+ */
+static bool
+ReadXLogTimestampForLsn(XLogTimestampBuffer *buffer,
+						XLogRecPtr lsn,
+						uint32 *read_head,
+						TimestampTz *timestamp)
+{
+	bool found = false;
+
+	/*
+	 * It's OK to access buffer->read_head without any kind synchronization
+	 * because in all cases the caller is the only process reading from the
+	 * buffer (ie writing to *buffer->read_head).
+	 */
+	*read_head = buffer->read_head;
+
+	/*
+	 * It's OK to access write_head without interlocking because it's an
+	 * aligned 32 bit value which we can read atomically on all supported
+	 * platforms to get some recent value, not a torn/garbage value.
+	 * Furthermore we must see a value that is at least as recent as any WAL
+	 * that we have written/flushed/replayed, because walreceiver calls
+	 * SetXLogTimestampAtLsn before writing.
+	 */
+	while (*read_head != buffer->write_head &&
+		   buffer->buffer[*read_head].lsn <= lsn)
+	{
+		found = true;
+		*timestamp = buffer->buffer[*read_head].timestamp;
+		*read_head = (*read_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	return found;
+}
+
+/*
+ * Called by the WAL receiver process after it has written up to 'lsn'.
+ * Return true if it has written any LSN location that had an associated
+ * timestamp, and write the timestamp to '*timestamp'.
+ */
+bool
+CheckForWrittenTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	Assert(AmWalReceiverProcess());
+
+	return ReadXLogTimestampForLsn(&XLogCtl->writeTimestamps, lsn,
+								   &XLogCtl->writeTimestamps.read_head,
+								   timestamp);
+}
+
+/*
+ * Called by the WAL receiver process after it has flushed up to 'lsn'.
+ * Return true if it has flushed any LSN location that had an associated
+ * timestamp, and write the timestamp to '*timestamp'.
+ */
+bool
+CheckForFlushedTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	Assert(AmWalReceiverProcess());
+
+	return ReadXLogTimestampForLsn(&XLogCtl->flushTimestamps, lsn,
+								   &XLogCtl->flushTimestamps.read_head,
+								   timestamp);
+}
+
+/*
+ * Called by the startup process after it has replayed up to 'lsn'.  Checks
+ * for timestamps associated with WAL positions that have now been replayed.
+ * If any are found, the latest such timestamp found is written to
+ * '*timestamp'.  Returns the new buffer read head position, which the caller
+ * should write into XLogCtl->timestamps.read_head while holding info_lck.
+ */
+static uint32
+CheckForAppliedTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	uint32 read_head;
+
+	Assert(AmStartupProcess());
+
+	ReadXLogTimestampForLsn(&XLogCtl->applyTimestamps, lsn, &read_head,
+							timestamp);
+
+	return read_head;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
@@ -6824,6 +6944,8 @@ StartupXLOG(void)
 			do
 			{
 				bool		switchedTLI = false;
+				TimestampTz	replayed_timestamp = 0;
+				uint32		timestamp_read_head;
 
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
@@ -6977,24 +7099,35 @@ StartupXLOG(void)
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
+				/* Check if we have replayed a timestamped WAL position */
+				timestamp_read_head =
+					CheckForAppliedTimestampedLsn(EndRecPtr,
+												  &replayed_timestamp);
+
 				/*
-				 * Update lastReplayedEndRecPtr after this record has been
-				 * successfully replayed.
+				 * Update lastReplayedEndRecPtr and lastReplayedTimestamp
+				 * after this record has been successfully replayed.
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->applyTimestamps.read_head = timestamp_read_head;
+				if (replayed_timestamp != 0)
+					XLogCtl->lastReplayedTimestamp = replayed_timestamp;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
-				 * lastReplayedEndRecPtr and sends a reply to the master.
+				 * lastReplayedEndRecPtr and sends a reply to the master.  We
+				 * also wake it if we have replayed a WAL position that has
+				 * an associated timestamp so that the upstream server can
+				 * measure our replay lag.
 				 */
-				if (doRequestWalReceiverReply)
+				if (doRequestWalReceiverReply || replayed_timestamp != 0)
 				{
 					doRequestWalReceiverReply = false;
-					WalRcvForceReply();
+					WalRcvForceReply(replayed_timestamp != 0);
 				}
 
 				/* Remember this record as the last-applied one */
@@ -11809,3 +11942,89 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Store an (lsn, timestamp) sample in a timestamp buffer.
+ */
+static void
+StoreXLogTimestampAtLsn(XLogTimestampBuffer *buffer,
+							TimestampTz timestamp, XLogRecPtr lsn)
+{
+
+	uint32 write_head = buffer->write_head;
+	uint32 new_write_head = (write_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+
+	Assert(AmWalReceiverProcess());
+
+	if (new_write_head == buffer->read_head)
+	{
+		/*
+		 * The buffer is full, so we'll rewind and overwrite the most
+		 * recent sample.  Overwriting the most recent sample means that
+		 * if we're not writing/flushing/replaying fast enough and the buffer
+		 * fills up, we'll effectively lower the sampling rate.
+		 */
+		new_write_head = write_head;
+		write_head = (write_head - 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	buffer->buffer[write_head].lsn = lsn;
+	buffer->buffer[write_head].timestamp = timestamp;
+	buffer->write_head = new_write_head;
+}
+
+/*
+ * Record the timestamp that is associated with a WAL position.
+ *
+ * This is called by walreceiver on standby servers when new messages arrive,
+ * using a timestamp and the latest known WAL position from the upstream
+ * server.  The timestamp will be sent back to the upstream server via
+ * walreceiver when the WAL position is eventually written, flushed and
+ * applied.
+ */
+void
+SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
+{
+	Assert(AmWalReceiverProcess());
+
+	/*
+	 * For the write case we don't need the spinlock because walreceiver
+	 * is both writer and reader.  Currently that is true also of the flush
+	 * case, but in future if that job is given to the WAL writer it should
+	 * be protected by the spinlock below.
+	 */
+	StoreXLogTimestampAtLsn(&XLogCtl->writeTimestamps, timestamp, lsn);
+	StoreXLogTimestampAtLsn(&XLogCtl->flushTimestamps, timestamp, lsn);
+
+	/*
+	 * For the apply cases, the spinlock is needed because the startup process
+	 * is the reader.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	StoreXLogTimestampAtLsn(&XLogCtl->applyTimestamps, timestamp, lsn);
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Get the timestamp for the most recently applied WAL record that carried a
+ * timestamp from the upstream server, and also the most recently applied LSN.
+ * (Note that the timestamp and the LSN don't necessarily relate to the same
+ * record.)
+ *
+ * This is similar to GetLatestXTime, except that it is advanced when WAL
+ * positions recorded with SetXLogReplayTimestampAtLsn have been applied,
+ * rather than commit records.
+ */
+TimestampTz
+GetXLogReplayTimestamp(XLogRecPtr *lsn)
+{
+	TimestampTz result;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn)
+		*lsn = XLogCtl->lastReplayedEndRecPtr;
+	result = XLogCtl->lastReplayedTimestamp;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 649cef8..2fd63e3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -685,6 +685,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..8fd7e23 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,6 +73,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			replication_lag_sample_interval;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
@@ -107,6 +108,10 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
+/* Latest timestamps for replication lag tracking. */
+static TimestampTz last_write_timestamp;
+static TimestampTz last_flush_timestamp;
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -138,7 +143,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int timestamps);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 
@@ -148,6 +153,16 @@ static void WalRcvSigUsr1Handler(SIGNAL_ARGS);
 static void WalRcvShutdownHandler(SIGNAL_ARGS);
 static void WalRcvQuickDieHandler(SIGNAL_ARGS);
 
+/*
+ * Which timestamps to include in a reply message.
+ */
+typedef enum XLogReplyTimestamp
+{
+	REPLY_WRITE_TIMESTAMP = 1,
+	REPLY_FLUSH_TIMESTAMP = 2,
+	REPLY_APPLY_TIMESTAMP = 4
+} XLogReplyTimestamp;
+
 
 static void
 ProcessWalRcvInterrupts(void)
@@ -424,6 +439,8 @@ WalReceiverMain(void)
 				len = walrcv_receive(wrconn, &buf, &wait_fd);
 				if (len != 0)
 				{
+					int timestamp = 0;
+
 					/*
 					 * Process the received data, and any subsequent data we
 					 * can read without blocking.
@@ -455,8 +472,17 @@ WalReceiverMain(void)
 						len = walrcv_receive(wrconn, &buf, &wait_fd);
 					}
 
+					/*
+					 * Check if we have written an LSN location for which we
+					 * have a timestamp from the upstream server, for
+					 * replication lag tracking.
+					 */
+					if (CheckForWrittenTimestampedLsn(LogstreamResult.Write,
+													  &last_write_timestamp))
+						timestamp = REPLY_WRITE_TIMESTAMP;
+
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, timestamp);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -493,15 +519,20 @@ WalReceiverMain(void)
 					ResetLatch(walrcv->latch);
 					if (walrcv->force_reply)
 					{
+						int timestamps = 0;
+
 						/*
 						 * The recovery process has asked us to send apply
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
+						if (walrcv->force_reply_apply_timestamp)
+							timestamps = REPLY_APPLY_TIMESTAMP;
 						walrcv->force_reply = false;
+						walrcv->force_reply_apply_timestamp = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, timestamps);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -559,7 +590,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, 0);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -911,7 +942,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, 0);
 				break;
 			}
 		default:
@@ -1074,7 +1105,18 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			/*
+			 * Check if we have just flushed a position for which we have a
+			 * timestamp from the upstream server, for replication lag
+			 * tracking.
+			 */
+			int timestamp = 0;
+
+			if (CheckForFlushedTimestampedLsn(LogstreamResult.Flush,
+											  &last_flush_timestamp))
+				timestamp = REPLY_FLUSH_TIMESTAMP;
+
+			XLogWalRcvSendReply(false, false, timestamp);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1092,21 +1134,27 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * The bitmap 'timestamps' specifies which timestamps should be included, for
+ * replication lag tracking purposes.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int timestamps)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
+	TimestampTz writeTimestamp = 0;
+	TimestampTz flushTimestamp = 0;
+	TimestampTz applyTimestamp = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
 	 * to exit before doing anything at all.
 	 */
-	if (!force && wal_receiver_status_interval <= 0)
+	if (!force && timestamps == 0 && wal_receiver_status_interval <= 0)
 		return;
 
 	/* Get current timestamp. */
@@ -1132,7 +1180,41 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyTimestamp = GetXLogReplayTimestamp(&applyPtr);
+	flushTimestamp = last_flush_timestamp;
+	writeTimestamp = last_write_timestamp;
+
+	/* Decide whether to send timestamps for replay lag estimation. */
+	if (replication_lag_sample_interval != -1)
+	{
+		static TimestampTz lastApplyTimestampSendTime = 0;
+
+		/*
+		 * Only send an apply timestamp if we were explicitly asked to by the
+		 * recovery process or if replay lag sampling is active but the
+		 * recovery process seems to be stuck.
+		 *
+		 * If we haven't heard from the recovery process in a time exceeding
+		 * wal_receiver_status_interval and yet it has not applied the highest
+		 * LSN we've heard about, then we want to resend the last replayed
+		 * timestamp we have; otherwise we zero it out and wait for the
+		 * recovery process to wake us when it has set a new accurate replay
+		 * timestamp.  Note that we can read latestWalEnd without acquiring the
+		 * mutex that protects it because it is only written to by this
+		 * process (walreceiver).
+		 */
+		if (((timestamps & REPLY_APPLY_TIMESTAMP) != 0) ||
+			(WalRcv->latestWalEnd > applyPtr &&
+			 TimestampDifferenceExceeds(lastApplyTimestampSendTime, now,
+										wal_receiver_status_interval * 1000)))
+			lastApplyTimestampSendTime = now;
+		else
+			applyTimestamp = 0;
+		if ((timestamps & REPLY_FLUSH_TIMESTAMP) == 0)
+			flushTimestamp = 0;
+		if ((timestamps & REPLY_WRITE_TIMESTAMP) == 0)
+			writeTimestamp = 0;
+	}
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
@@ -1140,6 +1222,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, flushPtr);
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(writeTimestamp));
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(flushTimestamp));
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(applyTimestamp));
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	/* Send it */
@@ -1244,18 +1329,41 @@ static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 {
 	WalRcvData *walrcv = WalRcv;
-
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+	bool newHighWalEnd = false;
+
+	static TimestampTz lastRecordedTimestamp = 0;
 
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
+	{
 		walrcv->latestWalEndTime = sendTime;
+		newHighWalEnd = true;
+	}
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * If replication lag sampling is active, remember the upstream server's
+	 * timestamp at the latest WAL end that it has, unless we've already
+	 * done that too recently or the LSN hasn't advanced.  We'll feed this
+	 * timestamp back once we have written and then flushed this LSN.  It will
+	 * also be fed back to us by the startup process when it eventually
+	 * replays this LSN, so that we can feed it back to the upstream server
+	 * for replay lag tracking purposes.
+	 */
+	if (replication_lag_sample_interval != -1 &&
+		newHighWalEnd &&
+		sendTime > TimestampTzPlusMilliseconds(lastRecordedTimestamp,
+											   replication_lag_sample_interval))
+	{
+		SetXLogTimestampAtLsn(sendTime, walEnd);
+		lastRecordedTimestamp = sendTime;
+	}
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
@@ -1291,12 +1399,14 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply.  Also used to send periodic messages
+ * which are used to compute pg_stat_replication.replay_lag.
  */
 void
-WalRcvForceReply(void)
+WalRcvForceReply(bool apply_timestamp)
 {
 	WalRcv->force_reply = true;
+	WalRcv->force_reply_apply_timestamp = apply_timestamp;
 	if (WalRcv->latch)
 		SetLatch(WalRcv->latch);
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5cdb8a0..3fbca0c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,6 +1545,25 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 }
 
 /*
+ * Compute the difference between 'timestamp' and 'now' in microseconds.
+ * Return -1 if timestamp is zero.
+ */
+static uint64
+compute_lag(TimestampTz now, TimestampTz timestamp)
+{
+	if (timestamp == 0)
+		return -1;
+	else
+	{
+#ifdef HAVE_INT64_TIMESTAMP
+		return now - timestamp;
+#else
+		return (now - timestamp) * 1000000;
+#endif
+	}
+}
+
+/*
  * Regular reply from standby advising of WAL positions on standby server.
  */
 static void
@@ -1553,15 +1572,30 @@ ProcessStandbyReplyMessage(void)
 	XLogRecPtr	writePtr,
 				flushPtr,
 				applyPtr;
+	int64		writeLagUs,
+				flushLagUs,
+				applyLagUs;
+	TimestampTz writeTimestamp,
+				flushTimestamp,
+				applyTimestamp;
 	bool		replyRequested;
+	TimestampTz now = GetCurrentTimestamp();
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
 	flushPtr = pq_getmsgint64(&reply_message);
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
+	writeTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
+	flushTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
+	applyTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
 	replyRequested = pq_getmsgbyte(&reply_message);
 
+	/* Compute the replication lag. */
+	writeLagUs = compute_lag(now, writeTimestamp);
+	flushLagUs = compute_lag(now, flushTimestamp);
+	applyLagUs = compute_lag(now, applyTimestamp);
+
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
 		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
@@ -1583,6 +1617,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLagUs >= 0)
+			walsnd->writeLagUs = writeLagUs;
+		if (flushLagUs >= 0)
+			walsnd->flushLagUs = flushLagUs;
+		if (applyLagUs >= 0)
+			walsnd->applyLagUs = applyLagUs;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1979,6 +2019,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLagUs = -1;
+			walsnd->flushLagUs = -1;
+			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2753,6 +2796,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+lag_as_interval(uint64 lag_us)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = lag_us;
+#else
+	result->time = lag_us / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2761,7 +2819,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2809,6 +2867,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLagUs;
+		int64		flushLagUs;
+		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2823,6 +2884,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLagUs = walsnd->writeLagUs;
+		flushLagUs = walsnd->flushLagUs;
+		applyLagUs = walsnd->applyLagUs;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2857,6 +2921,21 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[5] = true;
 			values[5] = LSNGetDatum(apply);
 
+			if (writeLagUs < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(lag_as_interval(writeLagUs));
+
+			if (flushLagUs < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(lag_as_interval(flushLagUs));
+
+			if (applyLagUs < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(lag_as_interval(applyLagUs));
+
 			/*
 			 * Treat a standby such as a pg_basebackup background process
 			 * which always returns an invalid flush location, as an
@@ -2864,7 +2943,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2878,12 +2957,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 545e9e0..90c608d 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1777,6 +1777,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 946ba9e..1adb598 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1800,6 +1800,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"replication_lag_sample_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the minimum time between WAL timestamp samples used to estimate replication lag."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&replication_lag_sample_interval,
+		1 * 1000, -1, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee8232f..f703e25 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -271,6 +271,8 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#replication_lag_sample_interval = 1s	# min time between timestamps recorded
+					# to estimate lag; -1 disables lag sampling
 
 
 #------------------------------------------------------------------------------
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..6feb95d 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -111,7 +111,7 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	/*
@@ -142,6 +142,12 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* writeTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* flushTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 568ff17..960e02f 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -321,7 +321,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -337,6 +337,12 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* writeTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* flushTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 7d21408..ee11cf5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -246,6 +246,12 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern void SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern bool CheckForWrittenTimestampedLsn(XLogRecPtr lsn,
+										  TimestampTz *timestamp);
+extern bool CheckForFlushedTimestampedLsn(XLogRecPtr lsn,
+										  TimestampTz *timestamp);
+extern TimestampTz GetXLogReplayTimestamp(XLogRecPtr *lsn);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index cd7b909..c6dd6b5 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2768,7 +2768,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 28dc1fc..41b248f 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,6 +23,7 @@
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int	replication_lag_sample_interval;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -119,6 +120,9 @@ typedef struct
 	 */
 	bool		force_reply;
 
+	/* include the latest replayed timestamp when replying? */
+	bool		force_reply_apply_timestamp;
+
 	/* set true once conninfo is ready to display (obfuscated pwds etc) */
 	bool		ready_to_display;
 
@@ -208,6 +212,6 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
-extern void WalRcvForceReply(void);
+extern void WalRcvForceReply(bool sendApplyTimestamp);
 
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..fb3a03f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -46,6 +46,9 @@ typedef struct WalSnd
 	XLogRecPtr	write;
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
+	int64		writeLagUs;
+	int64		flushLagUs;
+	int64		applyLagUs;
 
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 93b90fe..20517c9 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -233,9 +233,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e9cfadb..14147c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1809,10 +1809,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#11Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#10)
1 attachment(s)
Re: Measuring replay lag

On Thu, Dec 29, 2016 at 1:28 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 10:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work.

Here is an experimental version that reports the write, flush and
apply lag separately as requested. This is done with three separate
(lsn, timestamp) buffers on the standby side. The GUC is now called
replication_lag_sample_interval. Not tested much yet.

Here is a new version that is slightly refactored and fixes a problem
with stale samples after periods of idleness.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replay-lag-v16.patchapplication/octet-stream; name=replay-lag-v16.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d7b3bf..b894e31 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3310,6 +3310,26 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-replication-lag-sample-interval" xreflabel="replication_lag_sample_interval">
+      <term><varname>replication_lag_sample_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replication_lag_sample_interval</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls how often a standby should sample timestamps from upstream to
+        send back to the primary or upstream standby after writing, flushing
+        and replaying WAL.  The default is 1 second.  Units are milliseconds if
+        not specified.  A value of -1 disables the reporting of replication
+        lag.  Estimated lag can be seen in the <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view of the upstream server.
+        This parameter can only be set
+        in the <filename>postgresql.conf</> file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby_feedback">
       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1545f03..a422ac0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1405,6 +1405,24 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be written on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be flushed on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f8ffa5c..7e7312f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,8 @@ extern uint32 bootstrap_data_checksum_version;
 #define PROMOTE_SIGNAL_FILE		"promote"
 #define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
+/* Size of the circular buffer of timestamped LSNs. */
+#define XLOG_TIMESTAMP_BUFFER_SIZE 8192
 
 /* User-settable parameters */
 int			max_wal_size = 64;	/* 1 GB */
@@ -530,6 +532,26 @@ typedef struct XLogCtlInsert
 } XLogCtlInsert;
 
 /*
+ * A sample associating a timestamp with a given xlog position.
+ */
+typedef struct XLogTimestamp
+{
+	TimestampTz	timestamp;
+	XLogRecPtr	lsn;
+} XLogTimestamp;
+
+/*
+ * A circular buffer of LSNs and associated timestamps.  The buffer is empty
+ * when read_head == write_head.
+ */
+typedef struct XLogTimestampBuffer
+{
+	uint32			read_head;
+	uint32			write_head;
+	XLogTimestamp	buffer[XLOG_TIMESTAMP_BUFFER_SIZE];
+} XLogTimestampBuffer;
+
+/*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
@@ -648,6 +670,14 @@ typedef struct XLogCtlData
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 
+	/* timestamp from the most recently applied record associated with a timestamp. */
+	TimestampTz lastReplayedTimestamp;
+
+	/* buffers of timestamps for WAL that is not yet written/flushed/applied. */
+	XLogTimestampBuffer writeTimestamps;
+	XLogTimestampBuffer flushTimestamps;
+	XLogTimestampBuffer applyTimestamps;
+
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
 	 * only relevant for replication or archive recovery
@@ -6006,6 +6036,96 @@ CheckRequiredParameterValues(void)
 }
 
 /*
+ * Read and consume all records from 'buffer' whose position is <= 'lsn'.
+ * Return true if any such records are found, and write the latest timestamp
+ * found into *timestamp.  Write the new read head position into *read_head,
+ * so that the caller can store it with appropriate locking.
+ */
+static bool
+ReadXLogTimestampForLsn(XLogTimestampBuffer *buffer,
+						XLogRecPtr lsn,
+						uint32 *read_head,
+						TimestampTz *timestamp)
+{
+	bool found = false;
+
+	/*
+	 * It's OK to access buffer->read_head without any kind synchronization
+	 * because in all cases the caller is the only process reading from the
+	 * buffer (ie writing to *buffer->read_head).
+	 */
+	*read_head = buffer->read_head;
+
+	/*
+	 * It's OK to access write_head without interlocking because it's an
+	 * aligned 32 bit value which we can read atomically on all supported
+	 * platforms to get some recent value, not a torn/garbage value.
+	 * Furthermore we must see a value that is at least as recent as any WAL
+	 * that we have written/flushed/replayed, because walreceiver calls
+	 * SetXLogTimestampAtLsn before writing.
+	 */
+	while (*read_head != buffer->write_head &&
+		   buffer->buffer[*read_head].lsn <= lsn)
+	{
+		found = true;
+		*timestamp = buffer->buffer[*read_head].timestamp;
+		*read_head = (*read_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	return found;
+}
+
+/*
+ * Called by the WAL receiver process after it has written up to 'lsn'.
+ * Return true if it has written any LSN location that had an associated
+ * timestamp, and write the timestamp to '*timestamp'.
+ */
+bool
+CheckForWrittenTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	Assert(AmWalReceiverProcess());
+
+	return ReadXLogTimestampForLsn(&XLogCtl->writeTimestamps, lsn,
+								   &XLogCtl->writeTimestamps.read_head,
+								   timestamp);
+}
+
+/*
+ * Called by the WAL receiver process after it has flushed up to 'lsn'.
+ * Return true if it has flushed any LSN location that had an associated
+ * timestamp, and write the timestamp to '*timestamp'.
+ */
+bool
+CheckForFlushedTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	Assert(AmWalReceiverProcess());
+
+	return ReadXLogTimestampForLsn(&XLogCtl->flushTimestamps, lsn,
+								   &XLogCtl->flushTimestamps.read_head,
+								   timestamp);
+}
+
+/*
+ * Called by the startup process after it has replayed up to 'lsn'.  Checks
+ * for timestamps associated with WAL positions that have now been replayed.
+ * If any are found, the latest such timestamp found is written to
+ * '*timestamp'.  Returns the new buffer read head position, which the caller
+ * should write into XLogCtl->timestamps.read_head while holding info_lck.
+ */
+static uint32
+CheckForAppliedTimestampedLsn(XLogRecPtr lsn, TimestampTz *timestamp)
+{
+	uint32 read_head;
+
+	Assert(AmStartupProcess());
+
+	ReadXLogTimestampForLsn(&XLogCtl->applyTimestamps, lsn, &read_head,
+							timestamp);
+
+	return read_head;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
@@ -6824,6 +6944,8 @@ StartupXLOG(void)
 			do
 			{
 				bool		switchedTLI = false;
+				TimestampTz	replayed_timestamp = 0;
+				uint32		timestamp_read_head;
 
 #ifdef WAL_DEBUG
 				if (XLOG_DEBUG ||
@@ -6977,24 +7099,35 @@ StartupXLOG(void)
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 
+				/* Check if we have replayed a timestamped WAL position */
+				timestamp_read_head =
+					CheckForAppliedTimestampedLsn(EndRecPtr,
+												  &replayed_timestamp);
+
 				/*
-				 * Update lastReplayedEndRecPtr after this record has been
-				 * successfully replayed.
+				 * Update lastReplayedEndRecPtr and lastReplayedTimestamp
+				 * after this record has been successfully replayed.
 				 */
 				SpinLockAcquire(&XLogCtl->info_lck);
 				XLogCtl->lastReplayedEndRecPtr = EndRecPtr;
 				XLogCtl->lastReplayedTLI = ThisTimeLineID;
+				XLogCtl->applyTimestamps.read_head = timestamp_read_head;
+				if (replayed_timestamp != 0)
+					XLogCtl->lastReplayedTimestamp = replayed_timestamp;
 				SpinLockRelease(&XLogCtl->info_lck);
 
 				/*
 				 * If rm_redo called XLogRequestWalReceiverReply, then we wake
 				 * up the receiver so that it notices the updated
-				 * lastReplayedEndRecPtr and sends a reply to the master.
+				 * lastReplayedEndRecPtr and sends a reply to the master.  We
+				 * also wake it if we have replayed a WAL position that has
+				 * an associated timestamp so that the upstream server can
+				 * measure our replay lag.
 				 */
-				if (doRequestWalReceiverReply)
+				if (doRequestWalReceiverReply || replayed_timestamp != 0)
 				{
 					doRequestWalReceiverReply = false;
-					WalRcvForceReply();
+					WalRcvForceReply(replayed_timestamp != 0);
 				}
 
 				/* Remember this record as the last-applied one */
@@ -11809,3 +11942,106 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Store an (lsn, timestamp) sample in a timestamp buffer.
+ */
+static void
+StoreXLogTimestampAtLsn(XLogTimestampBuffer *buffer,
+							TimestampTz timestamp, XLogRecPtr lsn)
+{
+
+	uint32 write_head = buffer->write_head;
+	uint32 new_write_head = (write_head + 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+
+	Assert(AmWalReceiverProcess());
+
+	if (new_write_head == buffer->read_head)
+	{
+		/*
+		 * The buffer is full, so we'll rewind and overwrite the most
+		 * recent sample.  Overwriting the most recent sample means that
+		 * if we're not writing/flushing/replaying fast enough and the buffer
+		 * fills up, we'll effectively lower the sampling rate.
+		 */
+		new_write_head = write_head;
+		write_head = (write_head - 1) % XLOG_TIMESTAMP_BUFFER_SIZE;
+	}
+
+	buffer->buffer[write_head].lsn = lsn;
+	buffer->buffer[write_head].timestamp = timestamp;
+	buffer->write_head = new_write_head;
+}
+
+/*
+ * Record the timestamp that is associated with a WAL position.
+ *
+ * This is called by walreceiver on standby servers when new messages arrive,
+ * using a timestamp and the latest known WAL position from the upstream
+ * server.  The timestamp will be sent back to the upstream server via
+ * walreceiver when the WAL position is eventually written, flushed and
+ * applied.
+ */
+void
+SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
+{
+	bool applied_end = false;
+	static TimestampTz last_timestamp;
+	static XLogRecPtr last_lsn;
+
+	Assert(AmWalReceiverProcess());
+	Assert(replication_lag_sample_interval >= 0);
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+
+	/*
+	 * Check if we're fully applied, so we can avoid recording samples in that
+	 * case.  There is effectively no replay lag, and we don't want to report
+	 * bogus lag after a period of idleness.
+	 */
+	if (lsn == XLogCtl->lastReplayedEndRecPtr)
+		applied_end = true;
+
+	/*
+	 * Record this timestamp/LSN pair, if the LSN has moved since last time
+	 * and we haven't recorded a sample too recently.
+	 */
+	if (!applied_end &&
+		lsn > last_lsn &&
+		timestamp > TimestampTzPlusMilliseconds(last_timestamp,
+											replication_lag_sample_interval))
+	{
+		StoreXLogTimestampAtLsn(&XLogCtl->applyTimestamps, timestamp, lsn);
+		StoreXLogTimestampAtLsn(&XLogCtl->writeTimestamps, timestamp, lsn);
+		StoreXLogTimestampAtLsn(&XLogCtl->flushTimestamps, timestamp, lsn);
+
+		last_timestamp = timestamp;
+		last_lsn = lsn;
+	}
+
+	SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * Get the timestamp for the most recently applied WAL record that carried a
+ * timestamp from the upstream server, and also the most recently applied LSN.
+ * (Note that the timestamp and the LSN don't necessarily relate to the same
+ * record.)
+ *
+ * This is similar to GetLatestXTime, except that it is advanced when WAL
+ * positions recorded with SetXLogReplayTimestampAtLsn have been applied,
+ * rather than commit records.
+ */
+TimestampTz
+GetXLogReplayTimestamp(XLogRecPtr *lsn)
+{
+	TimestampTz result;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (lsn)
+		*lsn = XLogCtl->lastReplayedEndRecPtr;
+	result = XLogCtl->lastReplayedTimestamp;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 649cef8..2fd63e3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -685,6 +685,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..621aa24 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -73,6 +73,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			replication_lag_sample_interval;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
@@ -107,6 +108,10 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
+/* Latest timestamps for replication lag tracking. */
+static TimestampTz last_write_timestamp;
+static TimestampTz last_flush_timestamp;
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -138,7 +143,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int timestamps);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 
@@ -148,6 +153,16 @@ static void WalRcvSigUsr1Handler(SIGNAL_ARGS);
 static void WalRcvShutdownHandler(SIGNAL_ARGS);
 static void WalRcvQuickDieHandler(SIGNAL_ARGS);
 
+/*
+ * Which timestamps to include in a reply message.
+ */
+typedef enum XLogReplyTimestamp
+{
+	REPLY_WRITE_TIMESTAMP = 1,
+	REPLY_FLUSH_TIMESTAMP = 2,
+	REPLY_APPLY_TIMESTAMP = 4
+} XLogReplyTimestamp;
+
 
 static void
 ProcessWalRcvInterrupts(void)
@@ -424,6 +439,8 @@ WalReceiverMain(void)
 				len = walrcv_receive(wrconn, &buf, &wait_fd);
 				if (len != 0)
 				{
+					int timestamp = 0;
+
 					/*
 					 * Process the received data, and any subsequent data we
 					 * can read without blocking.
@@ -455,8 +472,17 @@ WalReceiverMain(void)
 						len = walrcv_receive(wrconn, &buf, &wait_fd);
 					}
 
+					/*
+					 * Check if we have written an LSN location for which we
+					 * have a timestamp from the upstream server, for
+					 * replication lag tracking.
+					 */
+					if (CheckForWrittenTimestampedLsn(LogstreamResult.Write,
+													  &last_write_timestamp))
+						timestamp = REPLY_WRITE_TIMESTAMP;
+
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, timestamp);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -493,15 +519,20 @@ WalReceiverMain(void)
 					ResetLatch(walrcv->latch);
 					if (walrcv->force_reply)
 					{
+						int timestamps = 0;
+
 						/*
 						 * The recovery process has asked us to send apply
 						 * feedback now.  Make sure the flag is really set to
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
+						if (walrcv->force_reply_apply_timestamp)
+							timestamps = REPLY_APPLY_TIMESTAMP;
 						walrcv->force_reply = false;
+						walrcv->force_reply_apply_timestamp = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, timestamps);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -559,7 +590,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, 0);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -911,7 +942,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, 0);
 				break;
 			}
 		default:
@@ -1074,7 +1105,18 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			/*
+			 * Check if we have just flushed a position for which we have a
+			 * timestamp from the upstream server, for replication lag
+			 * tracking.
+			 */
+			int timestamp = 0;
+
+			if (CheckForFlushedTimestampedLsn(LogstreamResult.Flush,
+											  &last_flush_timestamp))
+				timestamp = REPLY_FLUSH_TIMESTAMP;
+
+			XLogWalRcvSendReply(false, false, timestamp);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1092,21 +1134,27 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * The bitmap 'timestamps' specifies which timestamps should be included, for
+ * replication lag tracking purposes.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int timestamps)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
 	XLogRecPtr	applyPtr;
 	static TimestampTz sendTime = 0;
 	TimestampTz now;
+	TimestampTz writeTimestamp = 0;
+	TimestampTz flushTimestamp = 0;
+	TimestampTz applyTimestamp = 0;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
 	 * to exit before doing anything at all.
 	 */
-	if (!force && wal_receiver_status_interval <= 0)
+	if (!force && timestamps == 0 && wal_receiver_status_interval <= 0)
 		return;
 
 	/* Get current timestamp. */
@@ -1132,7 +1180,41 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyTimestamp = GetXLogReplayTimestamp(&applyPtr);
+	flushTimestamp = last_flush_timestamp;
+	writeTimestamp = last_write_timestamp;
+
+	/* Decide whether to send timestamps for replay lag estimation. */
+	if (replication_lag_sample_interval != -1)
+	{
+		static TimestampTz lastApplyTimestampSendTime = 0;
+
+		/*
+		 * Only send an apply timestamp if we were explicitly asked to by the
+		 * recovery process or if replay lag sampling is active but the
+		 * recovery process seems to be stuck.
+		 *
+		 * If we haven't heard from the recovery process in a time exceeding
+		 * wal_receiver_status_interval and yet it has not applied the highest
+		 * LSN we've heard about, then we want to resend the last replayed
+		 * timestamp we have; otherwise we zero it out and wait for the
+		 * recovery process to wake us when it has set a new accurate replay
+		 * timestamp.  Note that we can read latestWalEnd without acquiring the
+		 * mutex that protects it because it is only written to by this
+		 * process (walreceiver).
+		 */
+		if (((timestamps & REPLY_APPLY_TIMESTAMP) != 0) ||
+			(WalRcv->latestWalEnd > applyPtr &&
+			 TimestampDifferenceExceeds(lastApplyTimestampSendTime, now,
+										wal_receiver_status_interval * 1000)))
+			lastApplyTimestampSendTime = now;
+		else
+			applyTimestamp = 0;
+		if ((timestamps & REPLY_FLUSH_TIMESTAMP) == 0)
+			flushTimestamp = 0;
+		if ((timestamps & REPLY_WRITE_TIMESTAMP) == 0)
+			writeTimestamp = 0;
+	}
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, 'r');
@@ -1140,6 +1222,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, flushPtr);
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(writeTimestamp));
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(flushTimestamp));
+	pq_sendint64(&reply_message, TimestampTzToIntegerTimestamp(applyTimestamp));
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
 
 	/* Send it */
@@ -1244,7 +1329,6 @@ static void
 ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 {
 	WalRcvData *walrcv = WalRcv;
-
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
 	/* Update shared-memory status */
@@ -1256,6 +1340,16 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * If replication lag sampling is active, remember the upstream server's
+	 * timestamp at the latest WAL end that it has.  We'll be able to retrieve
+	 * this timestamp once we have written, flushed and finally applied this
+	 * LSN, so that we can report it to the upstream server for lag tracking
+	 * purposes.
+	 */
+	if (replication_lag_sample_interval != -1)
+		SetXLogTimestampAtLsn(sendTime, walEnd);
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
@@ -1291,12 +1385,14 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply.  Also used to send periodic messages
+ * which are used to compute pg_stat_replication.replay_lag.
  */
 void
-WalRcvForceReply(void)
+WalRcvForceReply(bool apply_timestamp)
 {
 	WalRcv->force_reply = true;
+	WalRcv->force_reply_apply_timestamp = apply_timestamp;
 	if (WalRcv->latch)
 		SetLatch(WalRcv->latch);
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5cdb8a0..3fbca0c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,6 +1545,25 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 }
 
 /*
+ * Compute the difference between 'timestamp' and 'now' in microseconds.
+ * Return -1 if timestamp is zero.
+ */
+static uint64
+compute_lag(TimestampTz now, TimestampTz timestamp)
+{
+	if (timestamp == 0)
+		return -1;
+	else
+	{
+#ifdef HAVE_INT64_TIMESTAMP
+		return now - timestamp;
+#else
+		return (now - timestamp) * 1000000;
+#endif
+	}
+}
+
+/*
  * Regular reply from standby advising of WAL positions on standby server.
  */
 static void
@@ -1553,15 +1572,30 @@ ProcessStandbyReplyMessage(void)
 	XLogRecPtr	writePtr,
 				flushPtr,
 				applyPtr;
+	int64		writeLagUs,
+				flushLagUs,
+				applyLagUs;
+	TimestampTz writeTimestamp,
+				flushTimestamp,
+				applyTimestamp;
 	bool		replyRequested;
+	TimestampTz now = GetCurrentTimestamp();
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
 	flushPtr = pq_getmsgint64(&reply_message);
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);		/* sendTime; not used ATM */
+	writeTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
+	flushTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
+	applyTimestamp = IntegerTimestampToTimestampTz(pq_getmsgint64(&reply_message));
 	replyRequested = pq_getmsgbyte(&reply_message);
 
+	/* Compute the replication lag. */
+	writeLagUs = compute_lag(now, writeTimestamp);
+	flushLagUs = compute_lag(now, flushTimestamp);
+	applyLagUs = compute_lag(now, applyTimestamp);
+
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
 		 (uint32) (flushPtr >> 32), (uint32) flushPtr,
@@ -1583,6 +1617,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLagUs >= 0)
+			walsnd->writeLagUs = writeLagUs;
+		if (flushLagUs >= 0)
+			walsnd->flushLagUs = flushLagUs;
+		if (applyLagUs >= 0)
+			walsnd->applyLagUs = applyLagUs;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1979,6 +2019,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLagUs = -1;
+			walsnd->flushLagUs = -1;
+			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2753,6 +2796,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+lag_as_interval(uint64 lag_us)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = lag_us;
+#else
+	result->time = lag_us / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2761,7 +2819,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2809,6 +2867,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLagUs;
+		int64		flushLagUs;
+		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2823,6 +2884,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLagUs = walsnd->writeLagUs;
+		flushLagUs = walsnd->flushLagUs;
+		applyLagUs = walsnd->applyLagUs;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2857,6 +2921,21 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[5] = true;
 			values[5] = LSNGetDatum(apply);
 
+			if (writeLagUs < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(lag_as_interval(writeLagUs));
+
+			if (flushLagUs < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(lag_as_interval(flushLagUs));
+
+			if (applyLagUs < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(lag_as_interval(applyLagUs));
+
 			/*
 			 * Treat a standby such as a pg_basebackup background process
 			 * which always returns an invalid flush location, as an
@@ -2864,7 +2943,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2878,12 +2957,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 545e9e0..90c608d 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1777,6 +1777,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 946ba9e..1adb598 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1800,6 +1800,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"replication_lag_sample_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the minimum time between WAL timestamp samples used to estimate replication lag."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&replication_lag_sample_interval,
+		1 * 1000, -1, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_receiver_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum wait time to receive data from the primary."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee8232f..f703e25 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -271,6 +271,8 @@
 					# in milliseconds; 0 disables
 #wal_retrieve_retry_interval = 5s	# time to wait before retrying to
 					# retrieve WAL after a failed attempt
+#replication_lag_sample_interval = 1s	# min time between timestamps recorded
+					# to estimate lag; -1 disables lag sampling
 
 
 #------------------------------------------------------------------------------
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index cb5f989..6feb95d 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -111,7 +111,7 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	/*
@@ -142,6 +142,12 @@ sendFeedback(PGconn *conn, int64 now, bool force, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* writeTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* flushTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 568ff17..960e02f 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -321,7 +321,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 1];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -337,6 +337,12 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
 	len += 8;
 	fe_sendint64(now, &replybuf[len]);	/* sendTime */
 	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* writeTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* flushTimestamp */
+	len += 8;
+	fe_sendint64(0, &replybuf[len]);	/* applyTimestamp */
+	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0;		/* replyRequested */
 	len += 1;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 7d21408..ee11cf5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -246,6 +246,12 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern void SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern bool CheckForWrittenTimestampedLsn(XLogRecPtr lsn,
+										  TimestampTz *timestamp);
+extern bool CheckForFlushedTimestampedLsn(XLogRecPtr lsn,
+										  TimestampTz *timestamp);
+extern TimestampTz GetXLogReplayTimestamp(XLogRecPtr *lsn);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a6cc2eb..80267b4 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2768,7 +2768,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 28dc1fc..41b248f 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -23,6 +23,7 @@
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
 extern bool hot_standby_feedback;
+extern int	replication_lag_sample_interval;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
@@ -119,6 +120,9 @@ typedef struct
 	 */
 	bool		force_reply;
 
+	/* include the latest replayed timestamp when replying? */
+	bool		force_reply_apply_timestamp;
+
 	/* set true once conninfo is ready to display (obfuscated pwds etc) */
 	bool		ready_to_display;
 
@@ -208,6 +212,6 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
-extern void WalRcvForceReply(void);
+extern void WalRcvForceReply(bool sendApplyTimestamp);
 
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..fb3a03f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -46,6 +46,9 @@ typedef struct WalSnd
 	XLogRecPtr	write;
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
+	int64		writeLagUs;
+	int64		flushLagUs;
+	int64		applyLagUs;
 
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 93b90fe..20517c9 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -233,9 +233,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e9cfadb..14147c5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1809,10 +1809,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#12Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#9)
Re: Measuring replay lag

On 21 December 2016 at 21:14, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).

I think it has a much better chance now that the replies from apply
are OK. Will check in this release, but not now.

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

Skipping adding the line delay to this was very specifically excluded
by Tom, so that clock disparity between servers is not included.

If the balance of opinion is in favour of including a measure of
complete roundtrip time then I'm OK with that.

You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.

I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.

This seems like a bug to me now that I look harder. The docs for
wal_receiver_status_interval say "Updates are sent each time the
write or flush positions change, or at least as often as specified by
this parameter." But it doesn't do that, as I think it should.

Or you could fix that by setting the XACT_COMPLETION_APPLY_FEEDBACK
bit in the xl_xinfo.xinfo for selected transactions, as a way to ask
the standby to send a reply when that commit record is applied, but
that only works for commit records. One of my goals was to be able to
report lag accurately even between commits (very large data load
transactions etc).

As we said, we do have keepalive records we could use for that.

Or you could fix that by sending a list of 'interesting LSNs' to the
standby, as a way to ask it to send a reply when those LSNs are
applied. Then you'd need a circular buffer of (LSN, time) pairs in
the primary AND a circular buffer of LSNs in the standby to remember
which locations should generate a reply. This doesn't seem to be an
improvement.

That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.

I'm dubious of keeping standby-side state, but I will review the patch.

As a bonus of this approach, we don't need to add the field into the replay
message that walreceiver can very frequently send back. Which might be
helpful in terms of networking overhead.

For the record, these replies are only sent approximately every
replay_lag_sample_interval (with variation depending on replay speed)
and are only 42 bytes with the new field added.

[1] /messages/by-id/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com

We have time to make any changes to allow this to be applied in this release.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Measuring replay lag

On Wed, Jan 4, 2017 at 1:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 21 December 2016 at 21:14, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

Skipping adding the line delay to this was very specifically excluded
by Tom, so that clock disparity between servers is not included.

If the balance of opinion is in favour of including a measure of
complete roundtrip time then I'm OK with that.

I deliberately included the network round trip for two reasons:

1. The three lag numbers tell you how long syncrep would take to
return control at the three levels remote_write, on, remote_apply.
2. The time arithmetic is all done on the primary side using two
observations of its single system clock, avoiding any discussion of
clock differences between servers.

You can always subtract half the ping time from these numbers later if
you really want to (replay_lag - (write_lag / 2) may be a cheap proxy
for a lag time that doesn't include the return network leg, and still
doesn't introduce clock difference error). I am strongly of the
opinion that time measurements made by a single observer are better
data to start from.

You could fix that by making the standby send a reply *every time* it
applies some WAL (like it does for transactions committing with
synchronous_commit = remote_apply, though that is only for commit
records), but then we'd be generating a lot of recovery->walreceiver
communication and standby->primary network traffic, even for people
who don't otherwise need it. It seems unacceptable.

I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.

You can achieve that with this patch by setting
replication_lag_sample_interval to 0.

The patch streams (time-right-now, end-of-wal) to the standby in every
outgoing message, and then sees how long it takes for those timestamps
to be fed back to it. The standby feeds them back immediately as soon
as it writes, flushes and applies those WAL positions. I figured it
would be silly if every message from the primary caused the standby
to generate 3 replies from the standby just for a monitoring feature,
so I introduced the GUC replication_lag_sample_interval to rate-limit
that. I don't think there's much point in setting it lower than 1s:
how often will you look at pg_stat_replication?

That's why I thought that the standby should have the (LSN, time)
buffer: it decides which samples to record in its buffer, using LSN
and time provided by the sending server, and then it can send replies
at exactly the right times. The LSNs don't have to be commit records,
they're just arbitrary points in the WAL stream which we attach
timestamps to. IPC and network overhead is minimised, and accuracy is
maximised.

I'm dubious of keeping standby-side state, but I will review the patch.

Thanks!

The only standby-side state is the three buffers of (LSN, time) that
haven't been written/flushed/applied yet. I don't see how that can be
avoided, except by inserting extra periodic timestamps into the WAL
itself, which has already been rejected.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#13)
Re: Measuring replay lag

On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

The patch streams (time-right-now, end-of-wal) to the standby in every
outgoing message, and then sees how long it takes for those timestamps
to be fed back to it.

Correction: we already stream (time-right-now, end-of-wal) to the
standby in every outgoing message. The patch introduces a new use of
that information by feeding them back upstream.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#13)
Re: Measuring replay lag

On Wed, Jan 4, 2017 at 12:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

(replay_lag - (write_lag / 2) may be a cheap proxy
for a lag time that doesn't include the return network leg, and still
doesn't introduce clock difference error)

(Upon reflection it's a terrible proxy for that because of the mix of
write/flush work done by WAL receiver today, but would improve
dramatically if the WAL writer were doing the flushing. A better yet
proxy might involve also tracking receive_lag which doesn't include
the write() syscall. My real point is that there are ways to work
backwards from the two-way round trip time to get other estimates, but
no good ways to undo the damage that would be done to the data if we
started using two systems' clocks.)

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#13)
Re: Measuring replay lag

On 3 January 2017 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.

You can achieve that with this patch by setting
replication_lag_sample_interval to 0.

I wonder why you ignore my mention of the bug in the correct mechanism?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#16)
Re: Measuring replay lag

On Wed, Jan 4, 2017 at 8:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 3 January 2017 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I don't see why that would be unacceptable. If we do it for
remote_apply, why not also do it for other modes? Whatever the
reasoning was for remote_apply should work for other modes. I should
add it was originally designed to be that way by me, so must have been
changed later.

You can achieve that with this patch by setting
replication_lag_sample_interval to 0.

I wonder why you ignore my mention of the bug in the correct mechanism?

I didn't have an opinion on that yet, but looking now I think there is
no bug: I was wrong about the current reply frequency. This comment
above XLogWalRcvSendReply confused me:

* If 'force' is not set, the message is only sent if enough time has
* passed since last status update to reach wal_receiver_status_interval.

Actually it's sent if 'force' is set, enough time has passed, or
either of the write or flush positions has moved. So we're already
sending replies after every write and flush, as you said we should.

So perhaps I should get rid of that replication_lag_sample_interval
GUC and send back apply timestamps frequently, as you were saying. It
would add up to a third more replies.

The effective sample rate would still be lowered when the fixed sized
buffers fill up and samples have to be dropped, and that'd be more
likely without that GUC. With the GUC, it doesn't start happening
until lag reaches XLOG_TIMESTAMP_BUFFER_SIZE *
replication_lag_sample_interval = ~2 hours with defaults, whereas
without rate limiting you might only need to get
XLOG_TIMESTAMP_BUFFER_SIZE 'w' messages behind before we start
dropping samples. Maybe that's perfectly OK, I'm not sure.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#17)
Re: Measuring replay lag

On Thu, Jan 5, 2017 at 12:03 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

So perhaps I should get rid of that replication_lag_sample_interval
GUC and send back apply timestamps frequently, as you were saying. It
would add up to a third more replies.

Oops, of course I meant to say up to 50% more replies...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Thomas Munro (#9)
Re: Measuring replay lag

On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

Yeah, even though the primary stores (100, 2017-01-17 00:00:00) as the pair of
(LSN, timestamp), for example, the standby may not send back the reply for
LSN 100 itself. The primary may receive the reply for larger LSN like 200,
instead. So the measurement of the lag in the primary side would not be so
accurate.

But we can calculate the "sync rep" lag by comparing the stored timestamp of
LSN 100 and the timestamp when the reply for LSN 200 is received. In sync rep,
since the transaction waiting for LSN 100 to be replicated is actually released
after the reply for LSN 200 is received, the above calculated lag is basically
accurate as sync rep lag.

Therefore I'm still thinking that it's better to maintain the pairs of LSN
and timestamp in the *primary* side. Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Fujii Masao (#19)
Re: Measuring replay lag

On Tue, Jan 17, 2017 at 7:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I agree that the capability to measure the remote_apply lag is very useful.
Also I want to measure the remote_write and remote_flush lags, for example,
in order to diagnose the cause of replication lag.

Good idea. I will think about how to make that work. There was a
proposal to make writing and flushing independent[1]. I'd like that
to go in. Then the write_lag and flush_lag could diverge
significantly, and it would be nice to be able to see that effect as
time (though you could already see it with LSN positions).

For that, what about maintaining the pairs of send-timestamp and LSN in
*sender side* instead of receiver side? That is, walsender adds the pairs
of send-timestamp and LSN into the buffer every sampling period.
Whenever walsender receives the write, flush and apply locations from
walreceiver, it calculates the write, flush and apply lags by comparing
the received and stored LSN and comparing the current timestamp and
stored send-timestamp.

I thought about that too, but I couldn't figure out how to make the
sampling work. If the primary is choosing (LSN, time) pairs to store
in a buffer, and the standby is sending replies at times of its
choosing (when wal_receiver_status_interval has been exceeded), then
you can't accurately measure anything.

Yeah, even though the primary stores (100, 2017-01-17 00:00:00) as the pair of
(LSN, timestamp), for example, the standby may not send back the reply for
LSN 100 itself. The primary may receive the reply for larger LSN like 200,
instead. So the measurement of the lag in the primary side would not be so
accurate.

But we can calculate the "sync rep" lag by comparing the stored timestamp of
LSN 100 and the timestamp when the reply for LSN 200 is received. In sync rep,
since the transaction waiting for LSN 100 to be replicated is actually released
after the reply for LSN 200 is received, the above calculated lag is basically
accurate as sync rep lag.

Therefore I'm still thinking that it's better to maintain the pairs of LSN
and timestamp in the *primary* side. Thought?

Ok. I see that there is a new compelling reason to move the ring
buffer to the sender side: then I think lag tracking will work
automatically for the new logical replication that just landed on
master. I will try it that way. Thanks for the feedback!

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Michael Paquier
michael.paquier@gmail.com
In reply to: Thomas Munro (#20)
Re: Measuring replay lag

On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Ok. I see that there is a new compelling reason to move the ring
buffer to the sender side: then I think lag tracking will work
automatically for the new logical replication that just landed on
master. I will try it that way. Thanks for the feedback!

Seeing no new patches, marked as returned with feedback. Feel free of
course to refresh the CF entry once you have a new patch!
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michael Paquier (#21)
1 attachment(s)
Re: Measuring replay lag

On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Ok. I see that there is a new compelling reason to move the ring
buffer to the sender side: then I think lag tracking will work
automatically for the new logical replication that just landed on
master. I will try it that way. Thanks for the feedback!

Seeing no new patches, marked as returned with feedback. Feel free of
course to refresh the CF entry once you have a new patch!

Here is a new version with the buffer on the sender side as requested.
Since it now shows write, flush and replay lag, not just replay, I
decide to rename it and start counting versions at 1 again.
replication-lag-v1.patch is less than half the size of
replay-lag-v16.patch and considerably simpler. There is no more GUC
and no more protocol change.

While the write and flush locations are sent back at the right times
already, I had to figure out how to get replies to be sent at the
right time when WAL was replayed too. Without doing anything special
for that, you get the following cases:

1. A busy system: replies flow regularly due to write and flush
feedback, and those replies include replay position, so there is no
problem.

2. A system that has just streamed a lot of WAL causing the standby
to fall behind in replaying, but the primary is now idle: there will
only be replies every 10 seconds (wal_receiver_status_interval), so
pg_stat_replication.replay_lag only updates with that frequency.
(That was already the case for replay_location).

3. An idle system that has just replayed some WAL and is now fully
caught up. There is no reply until the next
wal_receiver_status_interval; so now replay_lag shows a bogus number
over 10 seconds. Oops.

Case 1 is good, and I suppose that 2 is OK, but I needed to do
something about 3. The solution I came up with was to force one reply
to be sent whenever recovery runs out of WAL to replay and enters
WaitForWALToBecomeAvailable(). This seems to work pretty well in
initial testing.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v1.patchapplication/octet-stream; name=replication-lag-v1.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5b67def..816cc9b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1413,6 +1413,24 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be written on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be flushed on this
+      standby server</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be replayed on this
+      standby server</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2dcff7f..be0aae9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11509,6 +11509,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11832,6 +11833,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..60047d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ba506e2..62e0110 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,31 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* Constants for the read heads used in the lag tracking circular buffer. */
+#define LAG_TRACKER_WRITE_HEAD 0
+#define LAG_TRACKER_FLUSH_HEAD 1
+#define LAG_TRACKER_APPLY_HEAD 2
+#define LAG_TRACKER_NUM_READ_HEADS 3
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[LAG_TRACKER_NUM_READ_HEADS];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +244,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz time);
+static int64 LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +271,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1489,6 +1519,10 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	uint64		writeLag,
+				flushLag,
+				applyLag;
+	TimestampTz now;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1503,6 +1537,12 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(LAG_TRACKER_WRITE_HEAD, writePtr, now);
+	flushLag = LagTrackerRead(LAG_TRACKER_FLUSH_HEAD, flushPtr, now);
+	applyLag = LagTrackerRead(LAG_TRACKER_APPLY_HEAD, applyPtr, now);
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1518,6 +1558,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1914,6 +1960,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2239,6 +2288,13 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking if it
+	 * has moved forwards.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2688,6 +2744,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+usecs_to_interval(uint64 usecs)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = usecs;
+#else
+	result->time = usecs / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2696,7 +2767,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2744,6 +2815,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLag;
+		int64		flushLag;
+		int64		applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2758,6 +2832,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2799,7 +2876,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(usecs_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(usecs_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(usecs_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2813,12 +2905,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2886,3 +2978,85 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the current time, so that we can compute the
+ * lag when this WAL position is eventually reported by the standby.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written, which is
+	 * simple proxy for the time at which the log was written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < LAG_TRACKER_NUM_READ_HEADS; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer if full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple if somewhat uneven way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		LagTracker.write_head =
+			(LagTracker.write_head - 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed since WAL position 'lsn' or earlier was
+ * written to the lag tracking buffer and 'now'.  Return -1 if no time is
+ * available, and otherwise the elapsed time in microseconds.
+ */
+static int64
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+	int64 result;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	/* If the clock somehow went backwards, treat as not found. */
+	if (time == 0 || time > now)
+		result = -1;
+	else
+		result = TimestampTzToIntegerTimestamp(now) -
+			TimestampTzToIntegerTimestamp(time);
+
+	return result;
+}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 9b4c012..664b584 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1777,6 +1777,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 41c12af..246336f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..3ec7dfb 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Lag times in microseconds. */
+	int64		writeLag;
+	int64		flushLag;
+	int64		applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 21651b1..765fa81 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -108,9 +108,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
#23Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#22)
Re: Measuring replay lag

On 14 February 2017 at 11:48, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Here is a new version with the buffer on the sender side as requested.

Thanks, I will definitely review in good time to get this in PG10

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Thomas Munro (#22)
Re: Measuring replay lag

Hi Thomas.

At 2017-02-15 00:48:41 +1300, thomas.munro@enterprisedb.com wrote:

Here is a new version with the buffer on the sender side as requested.

This looks good.

+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be written on this
+      standby server</entry>

I think I would find a slightly more detailed explanation helpful here.

A few tiny nits:

+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written, which is
+	 * simple proxy for the time at which the log was written.

"which is simple" → "which is a simple"

+	 * If the buffer if full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple if somewhat uneven way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.

"buffer if" → "buffer is"

+ * Find out how much time has elapsed since WAL position 'lsn' or earlier was
+ * written to the lag tracking buffer and 'now'.  Return -1 if no time is
+ * available, and otherwise the elapsed time in microseconds.

Find out how much time has elapsed "between X and 'now'", or "since X".
(I prefer the former, i.e., s/since/between/.)

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#22)
Re: Measuring replay lag

On 14 February 2017 at 11:48, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Ok. I see that there is a new compelling reason to move the ring
buffer to the sender side: then I think lag tracking will work
automatically for the new logical replication that just landed on
master. I will try it that way. Thanks for the feedback!

Seeing no new patches, marked as returned with feedback. Feel free of
course to refresh the CF entry once you have a new patch!

Here is a new version with the buffer on the sender side as requested.
Since it now shows write, flush and replay lag, not just replay, I
decide to rename it and start counting versions at 1 again.
replication-lag-v1.patch is less than half the size of
replay-lag-v16.patch and considerably simpler. There is no more GUC
and no more protocol change.

While the write and flush locations are sent back at the right times
already, I had to figure out how to get replies to be sent at the
right time when WAL was replayed too. Without doing anything special
for that, you get the following cases:

1. A busy system: replies flow regularly due to write and flush
feedback, and those replies include replay position, so there is no
problem.

2. A system that has just streamed a lot of WAL causing the standby
to fall behind in replaying, but the primary is now idle: there will
only be replies every 10 seconds (wal_receiver_status_interval), so
pg_stat_replication.replay_lag only updates with that frequency.
(That was already the case for replay_location).

3. An idle system that has just replayed some WAL and is now fully
caught up. There is no reply until the next
wal_receiver_status_interval; so now replay_lag shows a bogus number
over 10 seconds. Oops.

Case 1 is good, and I suppose that 2 is OK, but I needed to do
something about 3. The solution I came up with was to force one reply
to be sent whenever recovery runs out of WAL to replay and enters
WaitForWALToBecomeAvailable(). This seems to work pretty well in
initial testing.

Thoughts?

Feeling happier about this for now at least.

I think we need to document how this works more in README or header
comments. That way I can review it against what it aims to do rather
than what I think it might do.

e.g. We need to document what replay_lag represents. Does it include
write_lag and flush_lag, or is it the time since the flush_lag. i.e.
do I add all 3 together to get the full lag, or would that cause me to
double count?

How sensitive is this? Does the lag spike quickly and then disappear
again quickly? If we're sampling this every N seconds, will we get a
realistic viewpoint or just a random sample? Should we smooth the
value, or present preak info?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Abhijit Menon-Sen (#24)
1 attachment(s)
Re: Measuring replay lag

On Thu, Feb 16, 2017 at 11:18 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

Hi Thomas.

At 2017-02-15 00:48:41 +1300, thomas.munro@enterprisedb.com wrote:

Here is a new version with the buffer on the sender side as requested.

This looks good.

Thanks for the review!

+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Estimated time taken for recent WAL records to be written on this
+      standby server</entry>

I think I would find a slightly more detailed explanation helpful here.

Fixed.

A few tiny nits:

+      * If the lsn hasn't advanced since last time, then do nothing.  This way
+      * we only record a new sample when new WAL has been written, which is
+      * simple proxy for the time at which the log was written.

"which is simple" → "which is a simple"

Fixed.

+      * If the buffer if full, for now we just rewind by one slot and overwrite
+      * the last sample, as a simple if somewhat uneven way to lower the
+      * sampling rate.  There may be better adaptive compaction algorithms.

"buffer if" → "buffer is"

Fixed.

+ * Find out how much time has elapsed since WAL position 'lsn' or earlier was
+ * written to the lag tracking buffer and 'now'.  Return -1 if no time is
+ * available, and otherwise the elapsed time in microseconds.

Find out how much time has elapsed "between X and 'now'", or "since X".
(I prefer the former, i.e., s/since/between/.)

Fixed.

I also added some more comments in response to Simon's request for
more explanation of how it works (but will reply to his email
separately). Please find version 2 attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v2.patchapplication/octet-stream; name=replication-lag-v2.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fad5cb0..28984d0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1417,6 +1417,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the expected delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f23e108..dffadb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11519,6 +11519,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11842,6 +11843,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..60047d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ba506e2..c1ebdff 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,31 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* Constants for the read heads used in the lag tracking circular buffer. */
+#define LAG_TRACKER_WRITE_HEAD 0
+#define LAG_TRACKER_FLUSH_HEAD 1
+#define LAG_TRACKER_APPLY_HEAD 2
+#define LAG_TRACKER_NUM_READ_HEADS 3
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[LAG_TRACKER_NUM_READ_HEADS];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +244,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static int64 LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +271,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1489,6 +1519,10 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	uint64		writeLag,
+				flushLag,
+				applyLag;
+	TimestampTz now;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1503,6 +1537,12 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(LAG_TRACKER_WRITE_HEAD, writePtr, now);
+	flushLag = LagTrackerRead(LAG_TRACKER_FLUSH_HEAD, flushPtr, now);
+	applyLag = LagTrackerRead(LAG_TRACKER_APPLY_HEAD, applyPtr, now);
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1518,6 +1558,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1914,6 +1960,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2239,6 +2288,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2688,6 +2763,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+usecs_to_interval(uint64 usecs)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = usecs;
+#else
+	result->time = usecs / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2696,7 +2786,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2744,6 +2834,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLag;
+		int64		flushLag;
+		int64		applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2758,6 +2851,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2799,7 +2895,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(usecs_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(usecs_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(usecs_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2813,12 +2924,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2886,3 +2997,94 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < LAG_TRACKER_NUM_READ_HEADS; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		LagTracker.write_head =
+			(LagTracker.write_head - 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static int64
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+	int64 result;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	/* If the clock somehow went backwards, treat as not found. */
+	if (time == 0 || time > now)
+		result = -1;
+	else
+		result = TimestampTzToIntegerTimestamp(now) -
+			TimestampTzToIntegerTimestamp(time);
+
+	return result;
+}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 9b4c012..664b584 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1777,6 +1777,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bb7053a..c1e9d4e 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..3ec7dfb 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Lag times in microseconds. */
+	int64		writeLag;
+	int64		flushLag;
+	int64		applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 21651b1..765fa81 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -108,9 +108,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
#27Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#25)
Re: Measuring replay lag

On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Feeling happier about this for now at least.

Thanks!

I think we need to document how this works more in README or header
comments. That way I can review it against what it aims to do rather
than what I think it might do.

I have added a bunch of new comments to explain in the -v2 patch (see
reply to Abhijit). Please let me know if you think I need to add
still more. I'm especially interested in your feedback on the block
of comments above the line:

+ LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());

Specifically, your feedback on the sufficiency of this (LSN, time)
pair + filtering out repeat LSNs as an approximation of the time this
LSN was flushed.

e.g. We need to document what replay_lag represents. Does it include
write_lag and flush_lag, or is it the time since the flush_lag. i.e.
do I add all 3 together to get the full lag, or would that cause me to
double count?

I have included full descriptions of exactly what the 3 times
represent in the user documentation in the -v2 patch.

How sensitive is this? Does the lag spike quickly and then disappear
again quickly? If we're sampling this every N seconds, will we get a
realistic viewpoint or just a random sample?

In my testing it seems to move fairly smoothly so I think sampling
every N seconds would be quite effective and would not be 'noisy'.
The main time it jumps quickly is at the end of a large data load,
when a slow standby finally reaches the end of its backlog; you see it
climb slowly up and up while the faster primary is busy generating WAL
too fast for it to apply, but then if the primary goes idle the
standby eventually catches up. The high lag number sometimes lingers
for a bit and then pops down to a low number when new WAL arrives that
can be applied quickly. It seems like a very accurate depiction of
what is really happening so I like that. I would love to hear other
opinions and feedback/testing experiences!

Should we smooth the
value, or present preak info?

Hmm. Well, it might be interesting to do online exponential moving
averages, similar to the three numbers Unix systems present for load.
On the other hand, I'm amazed no one has complained that I'm making
pg_stat_replication ridiculously wide already, and users/monitoring
system could easy do that kind of thing themselves, and the number
doesn't seem to jumping/noisy/in-need-of-smoothing. Same would go for
logging over time; seems like an external monitoring tool's bailiwick.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#27)
Re: Measuring replay lag

On 17 February 2017 at 07:45, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Feeling happier about this for now at least.

Thanks!

And happier again, leading me to move to the next stage of review,
focusing on the behaviour emerging from the design.

So my current understanding is that this doesn't rely upon LSN
arithmetic to measure lag, which is good. That means logical
replication should "just work" and future mechanisms to filter
physical WAL will also just work. This is important, so please comment
if you see that isn't the case.

I notice that LagTrackerRead() doesn't do anything to interpolate the
time given, so at present any attempt to prune the lag sample buffer
would result in falsely increasing the lag times reported. Which is
probably the reason why you say "There may be better adaptive
compaction algorithms." We need to look at this some more, an initial
guess might be that we need to insert fewer samples as the buffer
fills since the LagTrackerRead() algorithm is O(N) on number of
samples and thus increasing the buffer itself isn't a great plan.

It would be very nice to be able to say something like that the +/-
confidence limits of the lag are no more than 50% of the lag time, so
we have some idea of how accurate the value is at any point. We need
to document the accuracy of the result, otherwise we'll be answering
questions on that for some time. So lets think about that now.

Given LagTrackerRead() is reading the 3 positions in order, it seems
sensible to start reading the LAG_TRACKER_FLUSH_HEAD from the place
you finished reading LAG_TRACKER_WRITE_HEAD etc.. Otherwise we end up
doing way too much work with larger buffers.

Which makes me think about the read more. The present design
calculates everything on receipt of standby messages. I think we
should simply record the last few messages and do the lag calculation
when the values are later read, if indeed they are ever read. That
would allow us a much better diagnostic view, at least. And it would
allow you to show a) latest value, b) smoothed in various ways, or c)
detail of last few messages for diagnostics. The latest value would be
the default value in pg_stat_replication - I agree we shouldn't make
that overly wide, so we'd need another function to access the details.

What is critical is that we report stable values as lag increases.
i.e. we need to iron out any usage cases so we don't have to fix them
in PG11 and spend a year telling people "yeh, it does that" (like
we've been doing for some time). So the diagnostics will help us
investigate this patch over various use cases...

I think what we need to show some test results with the graph of lag
over time for these cases:
1. steady state - pgbench on master, so we can see how that responds
2. blocked apply on standby - so we can see how the lag increases but
also how the accuracy goes down as the lag increases and whether the
reported value changes (depending upon algo)
3. burst mode - where we go from not moving to moving at high speed
and then stop again quickly
+other use cases you or others add

Does the proposed algo work for these cases? What goes wrong with it?
It's the examination of these downsides, if any, are the things we
need to investigate now to allow this to get committed.

Some minor points on code...
Why are things defined in walsender.c and not in .h?
Why is LAG_TRACKER_NUM_READ_HEADS not the same as NUM_SYNC_REP_WAIT_MODE?
...and other related constants shouldn't be redefined either.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#28)
5 attachment(s)
Re: Measuring replay lag

On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

And happier again, leading me to move to the next stage of review,
focusing on the behaviour emerging from the design.

So my current understanding is that this doesn't rely upon LSN
arithmetic to measure lag, which is good. That means logical
replication should "just work" and future mechanisms to filter
physical WAL will also just work. This is important, so please comment
if you see that isn't the case.

Yes, my understanding (based on
/messages/by-id/f453caad-0396-1bdd-c5c1-5094371f4776@2ndquadrant.com
) is that this should in principal work for logical replication, it
just might show the same number in 2 or 3 of the lag columns because
of the way it reports LSNs.

However, I think a call like LagTrackerWrite(SendRqstPtr,
GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
what happens in XLogSendPhysical. I'm not sure about that.

I notice that LagTrackerRead() doesn't do anything to interpolate the
time given, so at present any attempt to prune the lag sample buffer
would result in falsely increasing the lag times reported. Which is
probably the reason why you say "There may be better adaptive
compaction algorithms." We need to look at this some more, an initial
guess might be that we need to insert fewer samples as the buffer
fills since the LagTrackerRead() algorithm is O(N) on number of
samples and thus increasing the buffer itself isn't a great plan.

Interesting idea about interpolation. The lack of it didn't "result
in falsely increasing the lag times reported", it resulted in reported
lag staying static for a period of time even though we were falling
further behind. I finished up looking into fixing this with
interpolation. See below.

About adaptive sampling: This patch does in fact "insert fewer
samples once the buffer fills". Normally, the sender records a sample
every time it sends a message. Now imagine that the standby's
recovery is very slow and the buffer fills up. The sender starts
repeatedly overwriting the same buffer element because the write head
has crashed into the slow moving read head. Every time the standby
makes some progress and reports it, the read head finally advances
releasing some space, so the sender is able to advance to the next
element and record a new sample (and probably overwrite that one many
times). So effectively we reduce our sampling rate for all new
samples. We finish up with a sampling rate that is determined by the
rate of standby progress. I expect you can make something a bit
smoother and more sophisticated that starts lowering the sampling rate
sooner and perhaps thins out the pre-existing samples when the buffer
fills up, and I'm open to ideas, but my intuition is that it would be
complicated and no one would even notice the difference.

LagTrackerRead() is O(N) not in the total number of samples, but in
the number of samples whose LSN is <= the LSN in the reply message
we're processing. Given that the sender record samples as it sends
messages, and the standby sends replies on write/flush of those
messages, I think the N in question here will typically be a very
small number except in the case below called 'overwhelm.png' when the
WAL sender would be otherwise completely idle.

It would be very nice to be able to say something like that the +/-
confidence limits of the lag are no more than 50% of the lag time, so
we have some idea of how accurate the value is at any point. We need
to document the accuracy of the result, otherwise we'll be answering
questions on that for some time. So lets think about that now.

The only source of inaccuracy I can think of right now is that if
XLogSendPhysical doesn't run very often, then we won't notice the
local flushed LSN moving until a bit later, and to the extent that
we're late noticing that, we could underestimate the lag numbers. But
actually it runs very frequently and is woken whenever WAL is flushed.
This gap could be closed by recording the system time in shared memory
whenever local WAL is flushed; as described in a large comment in the
patch, I figured this wasn't worth that.

Given LagTrackerRead() is reading the 3 positions in order, it seems
sensible to start reading the LAG_TRACKER_FLUSH_HEAD from the place
you finished reading LAG_TRACKER_WRITE_HEAD etc.. Otherwise we end up
doing way too much work with larger buffers.

Hmm. I was under the impression that we'd nearly always be eating a
very small number of samples with each reply message, since standbys
usually report progress frequently. But yeah, if the buffer is full
AND the standby is sending very infrequent replies because the primary
is idle, then perhaps we could try to figure out how to skip ahead
faster than one at a time.

Which makes me think about the read more. The present design
calculates everything on receipt of standby messages. I think we
should simply record the last few messages and do the lag calculation
when the values are later read, if indeed they are ever read. That
would allow us a much better diagnostic view, at least. And it would
allow you to show a) latest value, b) smoothed in various ways, or c)
detail of last few messages for diagnostics. The latest value would be
the default value in pg_stat_replication - I agree we shouldn't make
that overly wide, so we'd need another function to access the details.

I think you need to record at least the system clock time and advance
the read heads up to the reported LSNs when you receive a reply. So
the amount of work you could defer to some later time would be almost
none; subtracting one time from another.

What is critical is that we report stable values as lag increases.
i.e. we need to iron out any usage cases so we don't have to fix them
in PG11 and spend a year telling people "yeh, it does that" (like
we've been doing for some time). So the diagnostics will help us
investigate this patch over various use cases...

+1

I think what we need to show some test results with the graph of lag
over time for these cases:
1. steady state - pgbench on master, so we can see how that responds
2. blocked apply on standby - so we can see how the lag increases but
also how the accuracy goes down as the lag increases and whether the
reported value changes (depending upon algo)
3. burst mode - where we go from not moving to moving at high speed
and then stop again quickly
+other use cases you or others add

Good idea. Here are some graphs. This is from a primary/standby pair
running on my local development machine, so the times are low in the
good cases. For 1 and 2 I used pgbench TPCB-sort-of. For 3 I used a
loop that repeatedly dropped and created a huge table, sleeping in
between.

Does the proposed algo work for these cases? What goes wrong with it?
It's the examination of these downsides, if any, are the things we
need to investigate now to allow this to get committed.

The main problem I discovered was with 2. If replay is paused, then
the reported LSN completely stops advancing, so replay_lag plateaus.
When you resume replay, it starts reporting LSNs advancing again and
suddenly discovers and reports a huge lag because it advances past the
next sample in the buffer.

I realised that you had suggested the solution to this problem
already: interpolation. I have added simple linear interpolation that
checks if there is a future LSN in the buffer, and if so it
interpolates linearly to synthesise the local flush time of the
reported LSN, which is somewhere between the last and next sample's
recorded local flush time. This seems to work well for the
apply-totally-stopped case.

I added a fourth case 'overwhelm.png' which you might find
interesting. It's essentially like one 'burst' followed by a 100% ide
primary. The primary stops sending new WAL around 50 seconds in and
then there is no autovacuum, nothing happening at all. The standby
start is still replaying its backlog of WAL, but is sending back
replies only every 10 seconds (because no WAL arriving so no other
reason to send replies except status message timeout, which could be
lowered). So we see some big steps, and then we finally see it
flat-line around 60 seconds because there is still now new WAL so we
keep showing the last measured lag. If new WAL is flushed it will pop
back to 0ish, but until then its last known measurement is ~14
seconds, which I don't think is technically wrong.

Some minor points on code...
Why are things defined in walsender.c and not in .h?

Because they are module-private.

Why is LAG_TRACKER_NUM_READ_HEADS not the same as NUM_SYNC_REP_WAIT_MODE?
...and other related constants shouldn't be redefined either.

Hmm. Ok, changed.

Please see new patch attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v3.patchapplication/octet-stream; name=replication-lag-v3.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fad5cb0..28984d0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1417,6 +1417,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the expected delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f23e108..dffadb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11519,6 +11519,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11842,6 +11843,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..60047d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ba506e2..c6334df 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +239,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static int64 LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +266,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1489,6 +1514,10 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	uint64		writeLag,
+				flushLag,
+				applyLag;
+	TimestampTz now;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1503,6 +1532,12 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1518,6 +1553,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1914,6 +1955,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2239,6 +2283,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2688,6 +2758,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+usecs_to_interval(uint64 usecs)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = usecs;
+#else
+	result->time = usecs / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2696,7 +2781,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2744,6 +2829,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLag;
+		int64		flushLag;
+		int64		applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2758,6 +2846,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2799,7 +2890,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(usecs_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(usecs_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(usecs_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2813,12 +2919,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2886,3 +2992,138 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		LagTracker.write_head =
+			(LagTracker.write_head - 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static int64
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return TimestampTzToIntegerTimestamp(now) -
+		TimestampTzToIntegerTimestamp(time);
+}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 9b4c012..664b584 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1777,6 +1777,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bb7053a..c1e9d4e 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..3ec7dfb 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Lag times in microseconds. */
+	int64		writeLag;
+	int64		flushLag;
+	int64		applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 21651b1..765fa81 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -108,9 +108,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
overwhelm.pngimage/png; name=overwhelm.pngDownload
�PNG


IHDR��,�;PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������???������___'''�sV��^:7	pHYs���+qIDATx���v���:��] x����!�	�$����s�8�b��D��*��h5}�T[G�x ��j��/��4j������7�4��k#`�t�]�'���P+�!V��'n��&�����4g��k@�,����������m)���r.���U{4
��_!@�w9����f�1��� p���K�I��i�@X���g�����dBT�`fu{��Zt�TdBa&q{e_��[en�vJ����6n�������Ukfs+�t��k@s��fwo������u}5t�9V��VD�i�;��\g�I���:7x�br�5���h	n1����U{pb����ZC@�������i�����};�!�0^	�'�Vs
�T��~  �i�j����%�6��ynaOm:���U��}4���o�! D!*QA@�
BT����}����f�s�ST���[���������eC��x��C|��X6,��l��i��p����cgnCk��x������,g�E��{o����M2Bd$�^RPm~����8�BFVm[[�����Q�I�����c?����!`^��7�'��0��<WYh�O���5��X{��2pph��V�Y���G���p����j:�:��5�]0I�� `pc�!�s
D� ��g-��#o@��Z�+�W��z�\����b�9"�d(�L:�_-��+s�.��'��X�����0���n�����R>�����������-�M��6o�%�;�Y�"�}D��IW@SP��)U���!����r�=�����O����������.����2���M=���1�Z1���YZ1=?�����S�t������9�����D�`]���+`��Z�3`SU���yS�i��*��3���8:I��o�*`�yc��=�#����R��D���[�������$#D�P�g��1H�lwHc
���W�P������NDw�3�0�
�������X���P`�.�@�
�W��H��� `xp�7 `xp�7 `xp�7 `xp�7 `xp�7 `xp�7 `p�o� `pp� `pp� `pp� `pp� `pp� `pp���@�� �|C��o `h�

����p��
dVL7K�5�-�^]Cd����U�_��l��"+��v9�aW��h��*B+�W�E�����w��
�5M
��#�V�~�5 ^GJ��A���
������bWLG��H��~t����x�_���D�����x	5���"�$#��n��wA�[ �]�x������7A�{ �=��&x�	^���A�K,����]��Y=�^�5���A�+�6_p��z�C�� ���>x�u���q�B�C7�:@K���q��7����_k����������n�fJ���������2}��E��y��#G�ZJ��
��� �W�i����F^ZJ��������<.D@�����PU���.���6�g-����������Z(Sz��.��6��.����P��.�<�>Qj���}�S���=sB��H��0�&�O��q��A���A@��i1C���ZuP��gK����q�b���- �y�}pN�z�< z��N�4��������.���.�m��o.'D# ���`�_�,@@���}��HD@8Bf���U���I����Z������j�z�L����
�?��WL��J]�nK�����!,+��v�����O�3	��XVL���m
[�EpOHm\�B@8���j����B�?����?�z� ��=L�����EBP�3Zh�u$</Xm����a�O�����%�.u"�p��Q��u������[q(�{>`c�5�D@!NOD?3#�� #�	��}"���(�.���2���fD#�'��+g@���k@������'��Qp��Q0��<�(����?AN����D@A��
rfP7����(�-���O���'��M�T�-!�KK�����HHU}KH��Rj �$�R��a�O�S���lK�����=>/%E9Q������'���27���P��o�O�sZ�" ����i�>����,�?i����n?<k�B��}���x�{�R](���h�X���?q���Z������`�}���W�JF�?yNh����R�q����'�-%�������t_��n��' ��Q���,�0�>��RB `�!�{�8�~��@8-��B�����!����*�-��z�7��p��v���lMn�_D�(b�t�{:�L7|��1]p�{��Ap�bP�`����!)z8XU.3�/�N=j=8ddu�9�D@8�LF������ms����F@yN)�S��kV~cWj���Z3����`�w�f���dE����y������Eq�D����a<]�}����P\X������UEd8����9	��8Q��3��x,�qb2�a����0g�aL����/g*#��X!����{�����Sk���pD������s��|K�A��P!��	O�}��!q�
�A��@���i�6���
]�
�A�� ���@�� ����q��� �Rl00�5���[*#������-��B���:WF(�N��Be��"��9W�o>���RD08*#�������|���������R���"���q��i���?���p���U�%?��p��i��%O� `�TFX�d[������)���1p���j����F���X�D4�E�������z�����Q85
�JN�B�(8���:5%������/nZ�����t����i@�88k����X����C@8�lJ�v��}-U�������(�����wB���Z�w
���p����6�m���bKQ@�HHh1-�^gu
���8���n����l�V��b:����S2�K2L,�����DJ~�{����f7
���8��Z9&��H^����d�4�����R�/��1 `4N-VXlm��a�S)�u�)���s�an���p����v��?��-�q��`�8}��� `<D&�����E�0*�Zt�`S�B5���]�4����m)������W����U�^�[��C+���
"`LZ������1yf����
���E�� ��)�~�	B��"��!`T�V���aE��/5�U����YE�09��a�dC�jI�K���Z�D�@�������(Py�%)0E$
T�lI�K���c!`�<F@�K��<E@�K��<D@�K��  D!*���!*bZ���i����?���B>=Q@�K!-�:�.�EF��M��ED�^+�"ZtCZ�`L	-jS��S�h+��_�H)��O���b:&�&�0a���/e����)�������)^@�K�����)\@�K�����)Z@�K����@@�J��_�+ �eA��_* ���B��_6) ��C��_F�"���-$O!�6�!`N�!�k��eE1.��_^!�k}���(A��s���Q����;��G�2�������A@�J��_�d. ��N��_�d- ��BTr�
 c����� [��r�
!S���V�Tm#����b�������� !�Y����V�%,�k�z�_K�WrvRg@�+	1��5 ������b��WB�m���������VL?:��R�
CD�#�����!"�g���%�+	5��n	��#�d�+����A@�J>�_�d# ��I.�_�d" ��J�_�d! ��K�_�d  ��L��_�$/ ��M��_��- +O���HX@�{�
����* �=�D�����������C����HP@�{�	��"9��Y�& �=����������8���GB�~�D��I2��3IE@�{(��O%
���$! �=���##`��N���OFD�F5}�[&���������dV��v�u���i��}���&�
��t���������l��������x���&�A���qY��upl������C/����
�>�og���v�fW�?�����i�:��v�5���Z�t������g����q�����7oz���&??�=�fu8��,���������v;�u��q�����>#|%���_�E�]Kx<E[W��0�8�����8(����Um#�����;eoy����X+��s`�C�������&�H�Z����j�]n��Y��[5E��z�>�<��&��cJyZ��ZY���$v�[v�k�FM��A}E�K�w��cj%�������S�J �r����<������9���RT}����	sD��j���t���k����'kP_���"�p�X�E�����������n$v��Rf�����=R
�"�,W;�p-�?~���3P��9N��M�v�������W`�I�I_����T��[}F_�9��\d��d������B/4i�G���`�AH�
(t?�H��w^&���=p�N���<@�=����_�)��!|Cb�������LD�����0^js=��&��%�;�o�4KO����F�����%p����g�B����92���}����D�}#�q>�+v�����'q6���������V�k �
b�3��>,�����)�������Aj������Uc6��y�u�jy"�3�������a���������d3cwg�=S��m����7�Y.[����Am��F��7����7��s�G�z~�m�+�z�����#?��Dz{������&e�a���~�����{m����������~	S���d�\�oi@�h���ZM���7Cg��?���mk��j*Q0����fp��n2]Z�����K�=m7�]�?R�(_�,�����V]����&/�^p�x5o5�7��p��n2	�fp��&������m���Vl����j:����Wz��r:n�M���[�3����3`��Wa���7x��l��3n�n�~
8	�s]gh�MK�0o����o
�?p
X2���$QoG������\p�m��� �^F�f��!���k�a[��g<�4�~�Qp��3r�S��65t�0��5�R�^=o>������|�g�j��wf7X�g����#�>�3�E�p�FwB�G/���y��^����5~v"�`�Z���L�IEND�B`�
burst.pngimage/png; name=burst.pngDownload
pgbench.pngimage/png; name=pgbench.pngDownload
�PNG


IHDR��,�;PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������???������___'''�sV��^:7	pHYs���+>IDATx���
��*�Ee���
�[���I�vE��b��Z�;��8te_D�*��������xh/�|	���5~xcW�7n�
@^������7�_���xa����&����7�����)mm�hq}^�mk�� /K�^H�����"BnVg��7�ae��@n&��4{��{=����p�2���P�N��
�M���1�6v�x���c1w�7���U5C�/�m�ux����)5W��)u�-�;�M]����E�tJ�UU3�z�t(�KA@xC�}���B@@�A��V>��"�8�&���mL���c�<�!��p��&�N5�y����zc(��  ���s�S8����"�of���I������v�I7X������N�w�t���*�[�{�^XLh�_
i��5�m�S<�f�������w��`��4��k��
=q����r��Zlk6}��q���["x�0Ncj�g3����0(�%����,O�~X��@�K��a�w?,�n ����
����q���5��xs��}�>�$�m��l���$`X�a�96��i��p���o��J���y�3xN>N��0����_�5�k��������>\tl��kA��p����R�%���F��+�Hw]��T��s��q�i34�a>>~;'�5�[�wB�a�
�vN����l[���+�"���d�K��a�w?,�n ����
�������~7��������j�~X��@�K�gV����JXO/�=x[��|�'�t�p�X?<2������u�s*0m���o�H�<N0��	������[����,�/��Uv��0�pF���5�p���9X���>o'~..�����!���Z_S8A@kZ�������>^�qZ���Ia#1�>V�>�[��0����[;��^�	�������
����qZ�Y��H~�O��H��~��/��)��X�S�/����&>�^��O/l����3'�+�Z�S8�����E�wE��������g�� ���V����S�d���'������j��C�~����nO��/��f��+�0��o��W�T!`���f���������#`���9�V\�ps�w,>47LF8�@�C �
<�@�C\���G���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT���@@xT������"w&~X�@@xT���@@xT���@@xT���@@x�K\@�OK?�@�# �<�@�# �<�@�# ��
L�dF�M2}����~
(C��b�`7l�6��F��C��W�	{��6J����gJ?�!W`p�>�f��n���I?	�!W`hk����ml��46{�I ��t�������@,=4;��M7�������lN�}�>}���f�58����_tB���+�Xu�I ���0�Bo���s�xL��s@@����w���~(#��������w�=������2��p���G@@�p�k;���%�,P�N|�V��m�q?-zq ��}
����~���u_"szy ��}
��-�����1�<P��# ��4\�5����r@@�paZ�1��'����@���t~{fs���@@�h|�Q��� `�������X!`��B�7����U5Mp��B�wB��[�j5�����@@	
���+��e0}���V������
�#�]����l��������@@���,V(d�8�8�q�
���2���*W��e�o���W��N�F��C5��_�.>`)�D��C5�����k�t@,>tQ=De�W�2=��2�wBL�E.I/��P;
!1�0PF����e���Z_�^(c��t6rIza ���M��%����2<�HQ�9uOK]�@O�@tl����X|��zv+P���<@@������]/���H��\!�fDe��fF�����# ������?y>�w��-���r\!�Ou���*�I��q
�V3����=��
���������-Ed|�8���F@I��>i0n�>n��?m�t�opp�>;g�i����V��.a����6O?�r�S��f�LQ�'�G]���[�w��iM�0!u%�oV�7&e~����jS%LH����f����)��@@	I�3)�0�N����3`b��	H����G��a��C����p<&L���hS�e$��`����	������z|�}1���I�_-��X&��(�%z���2�pb0)���/�JN�:&e~��a<��|��
�?�+�����+��2R��]
�P��&��p3��	�P��N�1}+\�:Ve$(��?�H/
��O������q�_-`��i���y-9��u�T���
�4t�����^�c'�w���X�� ��i8�@� �<�@� �<�@� �����F��q@��[��4]��Z��h,<tQ�Kn��N9#]������F�N,<tQ�K�����G��C5��	[uUaV��	�P��%z������E
����>�����.jP�@�ti,<tQ�K�m���LT��	�L���^q�$e&L�������c��
���	����D@%I{�����{;'�,PG�@t���I/
��
�@@��>���6�����
���0O��^��&����:����:��#���"�.F���^>M����.kp�����}�&8<�#�r�����@@;����vZ���q	�i��^�A�������e
.l�+#���+w��7g���{�!`���\(����}*}hk��p_
�m�Y������.kp����~�U��4����4]���&8�����Yl�?_.z"Xx��W��+#��$���'�re7���s���89��L��}]�����	�^�?���R��V���t�8SF�s���L�j��?]�x.}1=���NL�z��B�Y��t�����P�t+n�p�w�iV?��i��,`������^�#ee�������;�S��3��=���e��:���`�3;�����HXp�@t���@@���\�%��:v+0�����$�\����'�,P���9���z��d�~��X%��V��������@@�S�/J/��:%����@@	wB:�A@��`��A@�
s����s��@L�j���K@L�fkc�w{���H/��4!�bBju�NF`J~��J�k��I/���Y���a������
�)�(�����k�Z���g��A@ik��g� `Z�Wh��^��qRzQ ��������@�P�|��<�e��:v+�����j����H�lm]wq^���e��:�gD+'E#`Z�7�L��r�-R�+� �W@}����I/
��a���e��:X����E�������E�gP�����BX�<�"�� ����0�8��B.^����Y@@!9����q���� `q��\�N]�r�q�k�n��\�'��jB�y�:�6x��u
��'�M��z�v�Z�[��7������%��.t]��
XUx��u
�PH~�/l	���A��B�5�V@�������3�
C�5����g����]��wp1���]��N���F�v���J��t,.t],��� ������(�A@!�
A�tP��������3��"`�t|��9��m����L@�tP���B0��� �L� `:(�A@!�
A�tP���B0��� �L� `:(�A@!�
A�tP���B0���|��� `:(�A@!%	xP���B0��� �L� `:(�A@!�
A�tP���B0��� ��N�d���qw���t|���y�&x~��6�g:�E��4a��i���M��HG�w��o����M�l}���](�ehk���}�i��4�M:�E���C@���������o������ `b(�ern�&�u�����[|�Y.>~
���Dj���,�a���<3��}�r,C�=��m������](�`����o�u�i�7��.��#��P�����B0k:�E����.��#��P�����2/�!`1 ������(�A@!�
A�d.q���E�n�/� `2��A@%�*A�dP	&��J0����C0k:���7���*A�dP	&��J0T��� �L��T��A@%�*A�dP	&��J0T��� �L� `2��A@%�*A�dPIY��@T��� �L� `2��A@%�*A�dP	&��J�p�7��t|��Gy�F����&���=��������oB� ��=��7��]��Y9�����&	���3�zS]l}�E|'��!���'��3�_���E�R��?�Y�o����0��^T//B�����$����
���f�������/l��y�������_���u������N��{���_5���$�v'�,[��a�bs �4�n���#��i�%����R]k��!i�sY�B��Q�Xj�Ol�{#m�B����	JKu}=����q}]��P�����DM�c��>�]lN���|�����LPZj,G]�v�������%J�K����"��k�%76�~���U�R��QW�3�>T�K�=�:���Y��)��m�%�����JKuf��������P����\������'��<��������\�:\{���v����J��?��d	�rU����l������Vf��v�=I�o��U�/L�I�R�x����*
�����j��������%������%���>C'���:�A���:!�,`�Ol7�I���&C��`i��<�+u�����*s����R�}b{�>��y�����"���k@>Ke�6�@��u�Ol��/��X��V\��&T�����g�����f��J&>���n������
^�0��K8"`��~\�y&�I@���W�K���	�Y�c�}��)7�Ac�e��R�{2�(^|����p�����V����b��<�/���)`���}�>;>����.>f��q���?���
����<����w����O��s��;>�����"�P�!`����G���x��?�hA]
�D����(i��e�����c#����b����,Y\*�/o
p���f^M#����
��7��l����)����p=8.-�������n8�/w�b+��fz�56����3`]
^������x�����]���x� ���A�q%����G�}]�+��Mq���>�Q��-�=��v���M���K1;����T�?��*�7`���^�"�V����z���|\���)iQ��5�(�K���k�O&��2H�c��4C���]����I�S/8��W>�
D�7��k�<z���S��#��?�qD.Vv�a5��czuk�<�g����V\34��>���5���Y�V�u��q�/$����|#�;!_�P�C���?wD(���0O���p	�R�����Z$IEND�B`�
pause.pngimage/png; name=pause.pngDownload
�PNG


IHDR��,�;PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������???������___'''�sV��^:7	pHYs���+�IDATx�����8E�:��Y��3��
�@R���@�w����4%��;�����*�J����������U(/�2A��p�v/ll��xe-�B�+�>~���j�s������3��r��NC@p!��������
�	�lw|�'!��H�FaW#���w��.��|]�Z5
�6e,g}���B���Q���(c�h�5����8�U��K��
�n���=��-:W�3r����
  8�4�.���+mU����R���EJ  �K������R��!�c��������@@�_�u!M�L;��0��ot���v'���M�q���4�N+�'F����thU�+`at���-Ld��`���6��'|�
�2���)pl��l�A���M��D��A@0�&��g�[@���X�zI}K�1y&�K�5l�c�H~K�K���R  ��p^W�|6/x~&�^�	��|A&0�G������L~
�D�zq&���s�m��
�����6���l�*����Z}������%o{�.�=`/`�wH�[/1l���u#}?5������A��B#`c��
���B�E>Hx/��	,����z��=��J�o����y?��foC��!�<��c�9|�������*!���F��0O>��s,�����V��G��\mZ�'!�����-���"I��!|����k�l��Q��8����05'����e���\H{D��,83n
��5�R �Z�u���3�,�B��B@b%\��aX�{��x�x-���C83����`g&����o��s+.N�����Q��2U�k����LG���^��v^�����0u������/�x���>��p[L}}+�R:�C����F�'>�#f��s i>S���_�r���#|��m10���+G|�����H�y#�sq�iP+�}�����z�����Q���&2B��11����77>^�	���>�C�	$�[")�
/t�w�s`�4�d[/`��+�B7�Y~�B�>�c6&5��*��T$�������|�,&]�q+d��1���L��Y@��� ���V�L@���!�� �< ���a<9S�@@����T+�D��'g�($S������R ��Lu�[3xr�:��b2U��e��LU2g�d!�3�u�&K�JcsWa�L�X(��e��C/�d^�z��	)MG��@@F `����+ROc��j��	)+u�G�y@�L�N<�%�j�_'������2a�.�* �:!8	9�E��
�`�mK#�L�~`q!�l���N@��;(-Se@@i��b�]��Lu�e��5rA�s3U���&V���TP`��X�<7SE@@���a�Y��L�Ef���bA�s3U���V{��TPh�J���`!�m��R3��� ������T�����f�(7S
lS
��� ��L��(xj�
���3=��BA�S3=(:����S-��'Q�}c�����h������[�Qg���:U"�2MiNi���f�8�0����T+;l�G����sN�6������=I�)�FB@��T���.�����M�! 
p��;��v��:!$�rI�����:������H�JI,<I��J-B{tg�s�}r��&�����}�0�O����3e����V��s�^��b{&0O��Y��O]�����i������L`��&�^�{ez0����M^�L`��"=A@o���R����3���,`Y�:L����~���'�����+a�G�0�=�k�j-T�5��=@@�}��k�|u��j?����e�L`�#=@�V��w&0E��&�39���t�f��Z��S6�hU\��H&��c
�]�&9�������^���p��h����z��6�FkB��	�d��7Gz��M����>��G��M��oL��8��#�M��[4.D_L��.f�3�1��y�����o��@�-��{ �3ge# `B��3����7�a��B!��Q��{�?r&�A����x��_'�1��]��'�`��������uB�2a����F���[�\���6I��r�h7�gwG.�@�!]Z�jtH%����nFt^'�����@��fFtY'�A&�X�gD�uB  ��`�����o/�(�	��6"��D��]����pB��Ll8������p�0���_��,8�����B4
pb����	��8(������p:��t0�3���:s��l ����Gg���U�	���	@�/�3[��OH%�rc�/�Q��fvB*Q&���#�>��0gA-��Tx3�~�����e��_@��W�}����A��D�����?����a�����/���}7�N�&8�"zD��!����y_���N��0?�aq��:�S���`n �$^��S��dnxT�������LU'�3��f�&Sn.`�����w;�N��ri��8��/m��aE���OBZ����;!�@�d�1|�@@�x��7L�SG��!�
 ����;�w����T-�@�w��G �2�8��89g.�)���.�_�q���9���^���@���1]��nvR�O����V�#!����_�rN��Y�}�n��.��E��#���������X�a^E$
�I�w�������<,T���L��NKu5iV��������1�2�
�gh��sR���;\����N�
����R]��M7�����e����� �,���_@��e��6����&�p�@{^�&�G���P��+$�T%p������3���0��L_^f�T#p����h���>N���&�!���L��9�-���7S�@���>��T#��qo���n@��e���!�NRC���j����^@��e��yqk�����!������j�<-n-`���;SU@��v�p'����������jn���W�5�Lq�w���=S=@�m�������S����M��R��L��ha�R����!��*qw��F5z���~��$���rpl�\�1y���0CW��Kw���cN6����?���	��*��������{��5h>��3Q�C�,gqw��^���l�RzD�3�����	��+8�����Bt�SW���m��]�	�e�:!���@7)�m8;�[�'!�@����9�SeB��9���g*�3�6�Z����Q��N.D/{Vk���D��:!���2�]���/S@���S�O��|.q��j���f�G0�Rq�
w�F��?����G�!`�U&�4p�dL�?�d��|�����O���\�����^��z<�����:k��:
n. z����i�B��9���\@��!��4%_�7�Q ��m�a��������]@m�O��bo��lz.��B�1.u���B4�4)?��j-\�h�P�~�U��j��11�����\ �7�gD��
/6���(�T �hR~Wpu�i���4������6)���P.�����I9Q��r��_^�$�M��2=�����
� nR�s��/��.YC��+���I��9���dT��F_@	�6)7�8����(a�&�^i�����U(a�%�����0���Ch���9��Y
/FB
�7�,*R'������Q��
*��x�uBzD���"� nP�!�W��x�!C/��7��	Y5�C@��4 d����b-��~X��1���2�R�������3�)/��?���~��Z����B>��hv����B>����v9�|��3�r*�2#h���2#h����"i�%�2y�4��j����sQ��@@VD���bz  '��\V5�YC.�����VN2"l���S��|Hpi�d  �\Z=����WP�!n�����\�ny5�y�-���!p���8�K��\Hl�5A@&$���  "�ZdQ��c-�*����nd�9�8����  :!�.b-�:!i]97��
/Gb]�Zt��e��C�D�D�@'$�E�E�N�M�����"�|1
��q�H,�R�~���;�j�R!�(���	)+u���b��\�XO�h(���+��k@$sAbe��2	['�K@���!�4%�s�j �X����
�_���2`��_���B�X��7��P�d���W��B�X�,���)���_���"�:�X(��R�����2�Xd	,���(�k�q`�H������*�5��6�L$Y�Db��	����4�P�YE`��	���
e`�����k�5�a`�,���"+�5�-]`�����*�p<d"pd�X&d	,�eV%���O5 v�uB@��BXgM~��D"p\�X(d	,��V$���O3 ��+��,��"��z�z�ID���B`��%�P���E���	��3��S��A���ZY����ll�����ha���V����`@��o��Et�^��bY�p�����=�Z~���GD��� �Z-��Q#!����O�q��+����#:��r����o��+�	��a��*!h������6���+�C����r���=�
�9�>�rB�up	(o8��W1���h�5gdZ�������\��E��\��so:��W*�j��,q�"���8��yc�yEC@��R�W4d�+yE?^���|����5���������.��;�{@���*��
Y�JE^�p�
��WY�JE^��p����WY�JE^��p����WY�JE^�Op��
�������,��7^
d�+y�?W�u����7���W;��+q�?V���4��W*����k�hx1�'�T���PW���bx�0V9?�o�P��mE�/��L�2a���T����bX������	�~�%��������C��;�j���6���M�-&};9�1O�]���8{D����&op90�2V�S\^�����$h2�V��=���I���8��N��py���e�b�B4�1u+�t���~!b�xy�,E0�T��CR��p0�v�H������`�{#�R�D���4���(OU�Q����V��]�C�`$c��j�`��I��)�����U��	����*��E
fT�!���]0�1���YH��hE����+!�_��������L�H�>X��|�"���SSTw�,Q��N�=A0��c@���Gp�s4XM�!���}���1x���y(��-�x�;�����rCUX�R46�	����p�!L�
��HLg��h�\��\	���"�y�'��!XG/q|��b�|�1�G
�g�i����h�^@�1��(�`���P�Q�s�3	��$����v�$�IH3H29�,@4I����&��x��p�x�n4E����X����d���TC0�����'�W~{�`���
yl�}���4����f�1	����������3����.��z��V\F4#�* ?����Mz��
��	Z0������v	��I���jU�%����p_p�nd���*?��t��-u�p��Uz�H�N<��hP��yp�&��M�D�w�����?���#��#+R=�=�����]~\M)r7��OHOc�[�'�BT�{�$`�/Z�	�&���+����]����J��G����	y���t��4E�lX����n�&v,`n�u�������0M�wz�,71k��	�[;���34��n]�/�>=���s��� ��g��8l�7)���0����I�/��q���4X��������?���yh��E�)Xcu[��&z^�n��I�;L��C�-`����f���+C�����4e�;�?���Y���n2�\�	�b��H�=����Z����|D���i7��{�|�%l��4
����O��>����H�������t}���m�U>I��Yp:k���m"[<&wz�w�	��Y�����%R�z���]�3N����6���8[><�
���:�M�]l�=j�`�\��-�E8|#wB*�P�]��	�Bw�����2p1�kl��P,IEND�B`�
#30Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#28)
5 attachment(s)
Re: Measuring replay lag

On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I think what we need to show some test results with the graph of lag
over time for these cases:
1. steady state - pgbench on master, so we can see how that responds
2. blocked apply on standby - so we can see how the lag increases but
also how the accuracy goes down as the lag increases and whether the
reported value changes (depending upon algo)
3. burst mode - where we go from not moving to moving at high speed
and then stop again quickly
+other use cases you or others add

Here are graphs of the 'burst' example from my previous email, with
LAG_TRACKER_BUFFER_SIZE set to 4 (really small so that it fills up)
and 8192 (the size I'm proposing in this patch). It looks like the
resampling and interpolation work pretty well to me when the buffer is
full.

The overall graph looks pretty similar, but it is more likely to short
hiccups caused by occasional slow WAL fsyncs in walreceiver. See the
attached graphs with 'spike' in the name: in the large buffer version
we see a short spike in write/flush lag and that results in apply
falling behind, but in the small buffer version we can only guess that
that might have happened because apply fell behind during the 3rd and
4th write bursts. We don't know exactly why because we didn't have
sufficient samples to detect a short lived write/flush delay.

The workload just does this in a loop:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo AS SELECT generate_series(1, 10000000);
SELECT pg_sleep(10);

While testing with a small buffer I found a thinko when write_head is
moved back, fixed in the attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v4.patchapplication/octet-stream; name=replication-lag-v4.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fad5cb0..28984d0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1417,6 +1417,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the expected delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f23e108..dffadb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11519,6 +11519,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11842,6 +11843,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..60047d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7a40863..8be8391 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +239,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static int64 LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +266,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1495,6 +1520,10 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	uint64		writeLag,
+				flushLag,
+				applyLag;
+	TimestampTz now;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1509,6 +1538,12 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1524,6 +1559,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1912,6 +1953,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2237,6 +2281,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2686,6 +2756,21 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+usecs_to_interval(uint64 usecs)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+#ifdef HAVE_INT64_TIMESTAMP
+	result->time = usecs;
+#else
+	result->time = usecs / 1000000.0;
+#endif
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2694,7 +2779,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2742,6 +2827,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		int64		writeLag;
+		int64		flushLag;
+		int64		applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2756,6 +2844,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2797,7 +2888,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(usecs_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(usecs_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(usecs_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2811,12 +2917,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2884,3 +2990,140 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		if (LagTracker.write_head > 0)
+			LagTracker.write_head--;
+		else
+			LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static int64
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return TimestampTzToIntegerTimestamp(now) -
+		TimestampTzToIntegerTimestamp(time);
+}
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 997a551..51a978d 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1776,6 +1776,20 @@ GetSQLLocalTimestamp(int32 typmod)
 }
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native timestamp to int64 format
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * 1000000;
+}
+#endif
+
+/*
  * TimestampDifference -- convert the difference between two timestamps
  *		into integer seconds and microseconds
  *
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bb7053a..c1e9d4e 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..3ec7dfb 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Lag times in microseconds. */
+	int64		writeLag;
+	int64		flushLag;
+	int64		applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 21651b1..765fa81 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -108,9 +108,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);
burst-4.pngimage/png; name=burst-4.pngDownload
burst-8192.pngimage/png; name=burst-8192.pngDownload
burst-4-spike.pngimage/png; name=burst-4-spike.pngDownload
�PNG


IHDR��,�;PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������???������___'''�sV��^:7	pHYs���+VIDATx���	��(Ee���E�r�+h�!MG�@}����U� ��
 ���]	�Z�P�?��t�?]�����B��x3Z�dD�Z�����/E	���8yP�P�����S�l�$���SE�I�z�2�=������4ehs�9	�G����&
�js"��3E�i�$����� +\`���lw�  p�i�"C@��0�s[a��B���bm�m%w�rD�/�n����[	����-����
��a[�V]S�
a5
hE�p��~#���%���:7
�?�$>m���9]W���m�����q��|���
U�����������V���]$h�1���1����q6��F5���
X�'%�����%�iv����Y���}4��o�A@��AV  �
Y�� +d�m#���������A������
�$t��l������m4�>6��p��v�R  X1m'���f ��s�c0`f%���������q�Xo�ZZ��i��`�Z�~�k��(����l��px"]w�~O��$L������5��! H�$`W���y�U�M��A@��$`o]���R��>:���@@��AV  �
Y�� +8�3G  �
)03`���a&����e�
��|�yu/���?�jM<O>,��/��a�2���QXf��� �W�(���e�9�)%���G|z�5}0��v���A�"����)��PU���~?�qX�j���;!`"���5�������Q��!`"��*�z|E�=��?H�=�90,�Bl�C�����a����������@�X��P����hE*b����`/D7�#�!`l��w<0��X�i���Gr^�E(Hk�e����p�WM)������lu�-`:Q{��H���I�HB�4���A@  M����z�"�f�6�\��@�k  !zu��^�!��4^+ ��K����7x8$�N�a����^�'���|�<��0p�)�S��DL�Ys�*�x��lO3����|������@�+�U��D@�+  +�
�
���(C@�i���|���9���������3�
?�+����/���@���9����v3��	������i����" �sQ�� `n  ]�] �9�.�.���^��4@�S  �
��Y��0+����! 
��w�9�	f�����	��99�cH<B��@@��W
��fB���w�c\���+�XA���#�^.�t.z�������}�Q���p��q��	
��/��o-��HG�h���x���!`. �f��{o,��(�	�_�q����������p�^���w�)��MnwA���Nj��@�} ��� ��E�4M�������	��F��S��Zy�����p�Y�yu�Q����e���,@�	?A��#>��V����@�	gAt]UJqz,t��o����K����i:�}k��@- ����&�� u�����l�Zb��q�/(����!��]g��]�x��H��G�&1���w�^����p���%���z�3�D���bw�[
�9�����W
!���}���[qNb�\�&H�u��;���}�@@-��Ra��������@�����X�FA�
^'������n����w��j�Vy���[ ��wA:{���p��h]
��*u5p\���0�
(].Dox�����0��v��,~z��^��BR:�[x>�x���
�B@�nM! ����B�RC��1�!��F���/� �)�^��Rt��W!^6�|�:-����/R>�{-�@��)�#���g�0�h��D@iw����0�h��C@%�n�<y*n�'	x�v�{��O!��8+�z�x��0����={��ui�#�!`h��b��1"F:+����
�vo+���o���GDC��Hg�<���t�}X1��(R&��>p����b$KQ�(L��,��Y������,E��0q0R�@� ��B��@����������RT�d(L@\���P� �L���u@���"\.>Cq�z�_:1=��P�����_��yH���0 ���"�Fg(L@3 U�2��%����P��b��������3
K���xH�����~�]�R�W_�~������v/�(��XfA�c&����k�Ec�&`qS�~N~������D{�
�,�3�G������=���	�9��?�G����6��on�b��������n+�o��yU��U��b|,�|��=��Q��t
�Km^�������������P����b#���#�%`������uD��t��Y��������r���t������!�)���F4�5
]-��0����{�y�,�Nd�S����������0�S:���tF� B@���a�&��u%�j��K��n��$�!���$A�W�y����.��Z����Ln���%`=�t'b�.�.�.�
t���u��9����o��
x�$R���
8����j���(d�����F��gD���}���48����O@�J���.���)zOG����A5����Tz���6�����h��n��a���2�e���TL��=��{i���N��V�(`��7��	8�������$x|�pP��*`�K��	8�95�����6���(L@��\�eB��\�����IH'�f{��*���,O�|�h�%`�+3�����_"`��e	(�r���_�,�5��g���8��8�*T��q��zW�P��B3Y�s `�����������3,^���\3����4��S������Sq0��	(p�����C�p�8�_qC����wj#�^8�j��u���@@�:�|GX�;~�c����
�//�+�R��NHQ����7A��������a���t"��k��8������^@���B�`fF��L6#��E��v�HgFH'$��>��eI�-��x��/��xB���ZO@I��6��U��h��3`f��G@@8�*�=GZ�7z��'��~��0��	�j!6S@�?����j9��)��,���L�P��3���p&���� `0>3#��q�?�F���U�V�nU�-�{#s�������� ��/DC@8"�\��"�w�W
H���/DC��h�u(K�I��oL�h$]��Gb�y\"q�C$p,�x8V�
�Q�����:�YpZ��t
A�F7A�I��K;K~b'���k����!�
������_�]|����X�)�P���S��^��-#��u�X,����0<��j����.���^����U(J��B��T��Bc$0G��
���kl�*�&�����1���W�$Y}#���SE�^@%�nu�!��:Ud�� �_E�r@*�NU*"Jz������x�1z�DH(�[�
�~n��+�PD�z��0��0�%WD��(L>7�o! y
������C�!�S��2Qa�X��D��	�q!������������n#$�@Q��������?QV���	�V �s�^^�n��P������1�& U��t��\�����  e��i����E5
�0���:QJ�H�:;.����:�����JD��"��K��z}�R�V]h��:`�!���:�A$���r\MR���]s�ME�<�I�������U��I���_��?51n�D-]������r�r5����/���d��ueNR�sy�*��I')�����s
H��(SNR�DH��(SNR�A@�S�GH��"')�������H�K@���/l"�N]��	'��������H�K��Tf0v��G@��8Aeh! u��L7A%�H��%I�l���A����P�����7
F.��S���M����!`�!�{8�� �Z}���-+N�A@�p��%�#��:/,~���H�V_" C�J0��,t�Y�������y!`�E)�*B@�L
+L�#Z �k@���S�OS,K�V!`(:gz��Q-����X��" CV�`a��
C��1)��@@�0���7O� `T��(YdI��Yo�h�B�P�)�g�]n�AIF��G��,tM����
8*��Z����	}p_|��N��������c[�) �����X@J�/ W�  	�dJ	}�����/`t�p����@@/ [�  	���������\#@@�L��(H�B���R��d7�  [&�&�" ��-!��ez��L#R$<��<[�>YV&��N- _�7H�6����A��M����k
} P��/�a|8�����ms��� �����%�RB@���dm��m[yH�6�`��l9�����_  k��������M�����7�^|�l�	4�����n-��!xn��v��|�;�N�h����!�9�n�U<���O����
zY�I@���
��,� �+��*A�M��]o�7��$���@@�D��y��	�)%M�d��B�sA@/  G"��i�]�&SI�C@�d)�:?T@�fA@�D����	G�@@  y�t�g��^��L�U�rA���,���� �3�Atsb������$z��<���� �3�Q�{
��s�0Q��( S���&�)
$��,�����M�0Q� �;qS�����>d��I�!t�5�#U����� `(��k�aH���x���\�d�z���
��7L�sC@  yn���L�Q�L���	�����0.�a<UGA�P  y�G	������/���)�DO��=��	����@@��)�r������g��	@@/  5O��9������a@@���H�z�T@��|���sC@/  un���
�H) {c  5�����@�0  u�D����dy���=S��3<���	��^����~�rX2���������>�����}�*�<.~����KB�[�x�j~���v5R��P@%T+E{���[�����������j/m��=pS��4�^����b�����}��C%j��u���tG�Wm�+�m@\���\[�S��c��f!�E�X���l�Z�H��������Q-���/8�F:[av��>�?����j�u���p��0}y��Y_�����~��J@A�����_�\4���>e^�7g��G'��{_�����aE@�����#��Vu�z��w�����/*��|9���>�:����vU�/����R�o���x.#.��Z�����#���fs��q%'���J�P����'�|})�7�W����V��p
���'!�����2���
@2�n����AY�Z����jB�F�{E�������`�?���J�{#M�Md��R�C������J����>���K�����c=t5Ox-����m0�Z7S��#+1X2���?���J��1R����Y����G����#k!���>,]�)�WW
Q���Y�����'}h-�c@��fl�#�����dK
=������yx���#�2������Em.����Bs	��X2�ph��%��p$/o* WW����
Kx{�\!t��=�Yt#Yd{jm:�a},��?	���Cp���v�TG�Sh�NBC�a��9*�uR��u�� ���\�mt���|�8��y����%sX��Ch��d��h7���y.D��k�]�M�G�l�'�y�������7����:��Y,7�$��8��+J�>��H7��m����G�u���|	!6�y9�yg!���J�W�%`�\
��q����ra�r�f�L���o)�t��F���C��P�x��E����0�6�������[$�Qx-�����lF�����G���������5�YN���im'�[@#`����vQ�0l�6a���[7Nx`C��C�i���G���v�����@U�2�0;���-��"�Z@;��8��\�b�ja�����1;_\oJ�����?�y6
�F����g0{���zYj���[h��� �\�-�a���4����q��"���l��V��O�Fj�T��E�f�K
3��s���� �6� �8u���c�q���T��y�&;������_��+L��\^7�O��8l�E*l��|�0t�9��Gd�4:}����1`�'����Pw�L�0?����?	�p�d��+�D�=��pN��	����T�����,��5o��5X�w��I{�g���S�y�?����1���"��9����F����w���z�y�
�-[�����Y]��-��`^�_��m��od�N�!0�^0����
4fA�h���ew�#y��IEND�B`�
burst-8192-spike.pngimage/png; name=burst-8192-spike.pngDownload
�PNG


IHDR��,�;PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������???������___'''�sV��^:7	pHYs���+�IDATx�����*Ee���!������'��������$���� bU����&�����6�"�7���K��u�EoFc�`R"E�%�f�Xp�[{`�m"�c%?%��h�~������&��:|������&��j�~����������I���!���l�\���A��J�����^�j��a�T��w�Z����US�7R�nu�^W��0Mc��P����&.�o���k�VWvCX�:�{C@�����Z����:�F�G��[��F=����h�lz��0�S��a�4V���J�ZU+Y�w��`E
��G��]����y�c�(����~T�I�2����D  X#��V�57���G�H~I��@@��AR  ����S&3��>  ��� )�s�J�_����T�������9�\/b38��
�8���~7t@���w��^n[E?33�[�I����v���Dg����f���������A@0��TJ:���@@0�-`��u�cn�����F��l8��n�~?���/c��$��|��]���/+[�4Rh�����n��������u������������L@@��AR  H
I�� )��tG���I�� )V9;�P��b|��/�*M�+��n���	!;{c�}�/d5��&'�R����������������	�K����mcM�F	���i�f���]pS�Z���L3���^��y[���q��:�9tV1-���N@3l���uK]�����Q(� �z���7������$��)�-��P����������#��c@�[�����S
R�[�~�%�C��e;B����z,\�N���95���.��1�c��Cj����hu������94Dk���n=K�=������a�huD�^�����,0�<G���q]��(4�<�s�WM)H��@c���tq:R
Ro<�>���"�AR  H
I�� )$��@@��AR  H
I�� )$��@@��AR  H
)��
��@@��AR  H
I�� )$��@@��AR  H
I�� )$��@@��AR  H
I�� )��W7"R���{�f�� ��z�6|X��!�C(@�Vo|Bj���|�����j�2>�t���|�h�����H! �5�����[�a����b �Cx���V���l���6 `�*�j�|p�7�$\��� %��N�hd�[i�'�|t�:e.`#��;�����F�Xy��F�zh�kj�s�q?��x�r�k�z����t���iN>�'!/�M����0p�-~�Lt�	�1�^���4��u[U�m6�=S�6����9�������u7h<�����V�Nt���'��g�|�E!jiT���~�����?b��I��^�Zw�"�L�"`����'p�m�<����d�!�p�P/��l�!`��Z�V�J����n���tW�'�� �`������}s0���t\o�c��It�N���j8�5[C`Q�Vp:3�w��n����
�r����Z7���! L�=�2������!`R�0�|pDQJ4D?���%`+t�`��%�o_|�|@�]�2�9�S�������%��t0�jn����b�)PS�����]���3a����d&�t;`�C@�����4���+�)�&�8*��#1)�-��%�X�O�h)N@��~��%�D��'Q����(
���I��~$�d&�����(��>p�
p<np�h��K�k��j����)(Q�K����,&��������I;��m���	(R�+��Jb�T�d&`�����'�0=d��&`o���������&3k%Du�!ux�Rp>��Tk���:���Z����	�`s���,��m���R�Q3����>L�|1��/�-�X�����������K>LI�J!�8��
B-L^��e�;V�. ���r_r��x�Q����������s@�h.�p��i1Ft��������L�����r�t�x�
fJ�jil#_�]p�0����F�~/\at�L)^�^��O>��,	(�Q���.`�������q?��p+����ih��^{aWr�W�w�?���#�p�A�xLC<B��<X<;K.��� `0�-@@p�G������03�����d|Y�V���C���S4��%�d��#��v��2x7��G����#d'S��<v���W>�����M�Mv}l�U�^�����T��3���)��q��h?��e%3�Guy��}9Xs/p�sh�����}��d_��cp"mQ�-���5Y�;����m����������P��v�o! /ex}N���|xX!3p���p����{���Yq�nZlY)C@�}�][t��(�xFH��(B@��.�
��oY ������~�����@@/  5��"��
f��p�� �?��"���R�1z%����N$���������u��������FF������	C@NW���)`d��<<�p�p���.�����)O? %+?##�#o���x��7�FFh��1��f�x��
{������p��48�����Za�	�)GS���Q�����F��qaZ|iG�l����{F �!��+�g�.H�����}���8��grS@^���m-�C�z��$X�
��]]o��%N�a����52���#��'���[�[@B ���Bq���F�W���
����/�{�;����}#?[��w��3�&`��\dR��OL��l3x�Wg��n�g����p�Pj-����g��m���4��m�r�|��������
�#3C�� �y���_�[��Hw��.8`�7	��kO���$���&����o�9�����\�[<���������4�w����V�G�L>���a���`�����4��8�L�n�U  ����<!����y�$`-�i�a��[�'t�/��#`P����H�!n����a��!Q��n�u  ���c���?&��<+=,���Kc���7�z����	�?��}�d�z��K�L�%�8Cr0pY�����S��$�C�	��h�������?�;x�d����}�&QC4Ww���I��x��	�#�g�A�����F�po��]�������P���}X��g<����������'�����/><��>4���$.~��G4���eR~?���A��A!��M��p����;@��9�0xI()����V� ������&�&�A�7��X��o0� ����7	�d���j�4�P�n��@�$�A�!�q�o0� �{�z���4oy�X��)�@@�V�n��hq�$��{~��V�\���&�%�*�p>�+O�;�A9@%��2���-Jhq7'��p�J�;J�������	��=�$�i_,�'�)��F��Z��Xr�P�Gxk�!��lX�����[1�����3�.��c��jA@�(���L����Z_)a0�������{��w�
R�^-.|�B���w�.x&nR8���#��T\�4
��_h�I� ��.�����C�+9pI=6T_3�?����n���/�4��F�?��E���!�.�;zNrH����Y@�6�r�]3��_8��B������!��'<S@^�!!�>�	������50�� ��-�pa! >���U�S@�o�\	j�T�^F�%�,����+��c>��@@���}	�??���K�&��:A@��<{�?�_	��K�9z����b�	HP�{��!�!�n��[,���b�fx�;��V�/��w�% I�n�0\,U@�zA@�Y}
��h
9�i*�rqNHX/�?',T@�J�\��) U�r���\����) Y�B�.�	��$��G@^�$ ]�|����
��J����������q��A@�|�*h@���N>
���g ����O�C@�*�	���R����z��������,��k��Zm.���  A~:�J~�����A@�#E_����o�|�~~:�J~������\�,�A@�|�#%_���?��_��L�%��^o0�`H���'�<���C����b��.�|F6�RH_1Jk  ��R��! !�!��/�tN>� �MS�C��Y!�uhOB�d�1�QrB��N@V�a��k�S0��]@��%
��_Q��������=K�o��b%
��R�H+`�+�%C@:^*�����f�/��,K�!�{��! �<��No�H)`�>K%���_By*��K���3K		��g��dH��/U@�z�0R�%����g�����v���dF���R4����/� ���_%"	�U�T��`�R4�����i��]�b)��F��G�e�P@���N@��w3���8/�r1�,c��~AR4�#��Q�S%�O@��%0�Z�(����eC��|HT6����8J��tD0�JJ$`�����f�,�[�(+i���! �*>~��	�Y���YI��
���D�0��F�/����� �w~B)��C�n4�]@��E0���4��oA�E3��/�
}����SM/�����8�/ o�"
���G^2�.�Z��3?����t�]���#	�\�x���K@��! �
��y@�;�qW,��Q�[
�Q0$�T?�(�8���4���`�X��o! K�y8����c8�I
c�P'���! ��].��Z�&v����9v�;x�A$S�����lUy	��Sj��d"����y��0�H��R'�#�V��oS��?�<�e~��7/�w�2pj�"�h�9��w����-���.�����3O��o�����V� �VB��ga�-6��������������6S�b��U����g����Gt `4W%~���!�u#E���������}�y��?�����c���X�����}-��b����b������'K����s��2�
��������q���Uo+������t�_S.�!�����Y���W;~����������&^�N�F���T[����{��u���|��+>X����[����+�r��[�x�s���-�DQ-�^l��;(���a%�	h���^��'�^�W~��)�}�v��~ �-`��8.����)�'y�k��i������#����J���j��E_��~+�u�u�_G_��X�����@�m%q����(��;+�u���U��,<"zp��!dn��%��fb��
@4�.�J��eP*�j�x���!�h���G����F����������V�&3�Ka��6���������'%��1�0��]�;���2tzhO�N@�h���zZ`��Z���t�c4����#O������+��i�'!���>�.��Z�{]Ti��%������6b<����`��8�i*g�K�?E0���U����<��%�X����g�[!��\N��U9Ds	��X2�ph��%��p$/* ���lw�'�Q�%��,�����,V#Y�;��+���X
H�?	��*�p���fh�����m

�IH�=l7�Bs��T��l�r������\�.�>Z��0����=X2�R������y�m��y��	O�k����%[C��6�����RO���]����:��Y,�$��8��*r���v�v����3���z���%����tr�B@��H�W�%��.�a<l��\��E��Y�Cmh����$�Q<��t�g�����$���t3�o�t����+�]0�R��TE#����m��s��	!7�A����YNS����V,��V��7R��4�i-���'F�7����.��00����H��|��v����3�
�a��j�$j-�kkpk����j;1����y����M��v
����G�pn4u��|���{*%�r�a��v-�;��i�$���:����<0�t}�\�8K�'���t����I��q[��������W�T��&�X�qZ7� �o� �8t���c�,��S�T�����Su5��Q@�+�~��.m��&���W��-�8K�-`��Ga����x��lF��d������s\gQ���U��1�(�W���te��qg�B
���8��;���r'�r:�g�?~��8�7c��=���Q���y����"g
�#�H;|�h�6�Z17��q�v��������i�yP���U;`5nQ��Y��B|.��p�B���B������A�a��f1����3��(��IEND�B`�
#31Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#30)
Re: Measuring replay lag

On Thu, Feb 23, 2017 at 11:52 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

The overall graph looks pretty similar, but it is more likely to short
hiccups caused by occasional slow WAL fsyncs in walreceiver. See the

I meant to write "more likely to *miss* short hiccups".

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#29)
Re: Measuring replay lag

On 21 February 2017 at 21:38, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

And happier again, leading me to move to the next stage of review,
focusing on the behaviour emerging from the design.

So my current understanding is that this doesn't rely upon LSN
arithmetic to measure lag, which is good. That means logical
replication should "just work" and future mechanisms to filter
physical WAL will also just work. This is important, so please comment
if you see that isn't the case.

Yes, my understanding (based on
/messages/by-id/f453caad-0396-1bdd-c5c1-5094371f4776@2ndquadrant.com
) is that this should in principal work for logical replication, it
just might show the same number in 2 or 3 of the lag columns because
of the way it reports LSNs.

However, I think a call like LagTrackerWrite(SendRqstPtr,
GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
what happens in XLogSendPhysical. I'm not sure about that.

Me neither, but I think we need this for both physical and logical.

Same use cases graphs for both, I think. There might be issues with
the way LSNs work for logical.

I think what we need to show some test results with the graph of lag
over time for these cases:
1. steady state - pgbench on master, so we can see how that responds
2. blocked apply on standby - so we can see how the lag increases but
also how the accuracy goes down as the lag increases and whether the
reported value changes (depending upon algo)
3. burst mode - where we go from not moving to moving at high speed
and then stop again quickly
+other use cases you or others add

Good idea. Here are some graphs. This is from a primary/standby pair
running on my local development machine, so the times are low in the
good cases. For 1 and 2 I used pgbench TPCB-sort-of. For 3 I used a
loop that repeatedly dropped and created a huge table, sleeping in
between.

Thanks, very nice

Does the proposed algo work for these cases? What goes wrong with it?
It's the examination of these downsides, if any, are the things we
need to investigate now to allow this to get committed.

The main problem I discovered was with 2. If replay is paused, then
the reported LSN completely stops advancing, so replay_lag plateaus.
When you resume replay, it starts reporting LSNs advancing again and
suddenly discovers and reports a huge lag because it advances past the
next sample in the buffer.

I realised that you had suggested the solution to this problem
already: interpolation. I have added simple linear interpolation that
checks if there is a future LSN in the buffer, and if so it
interpolates linearly to synthesise the local flush time of the
reported LSN, which is somewhere between the last and next sample's
recorded local flush time. This seems to work well for the
apply-totally-stopped case.

Good

I added a fourth case 'overwhelm.png' which you might find
interesting. It's essentially like one 'burst' followed by a 100% ide
primary. The primary stops sending new WAL around 50 seconds in and
then there is no autovacuum, nothing happening at all. The standby
start is still replaying its backlog of WAL, but is sending back
replies only every 10 seconds (because no WAL arriving so no other
reason to send replies except status message timeout, which could be
lowered). So we see some big steps, and then we finally see it
flat-line around 60 seconds because there is still now new WAL so we
keep showing the last measured lag. If new WAL is flushed it will pop
back to 0ish, but until then its last known measurement is ~14
seconds, which I don't think is technically wrong.

If I understand what you're saying, "14 secs" would not be seen as the
correct answer by our users when the delay is now zero.

Solving that is where the keepalives need to come into play. If no new
WAL, send a keepalive and track the lag on that.

Some minor points on code...
Why are things defined in walsender.c and not in .h?

Because they are module-private.

;-) It wasn't a C question.

So looks like we're almost there.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#32)
1 attachment(s)
Re: Measuring replay lag

On Fri, Feb 24, 2017 at 9:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 21 February 2017 at 21:38, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

However, I think a call like LagTrackerWrite(SendRqstPtr,
GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
what happens in XLogSendPhysical. I'm not sure about that.

Me neither, but I think we need this for both physical and logical.

Same use cases graphs for both, I think. There might be issues with
the way LSNs work for logical.

This seems to be problematic. Logical peers report LSN changes for
all three operations (write, flush, commit) only on commit. I suppose
that might work OK for synchronous replication, but it makes it a bit
difficult to get lag measurements that don't look really strange and
sawtoothy when you have long transactions, and overlapping
transactions might interfere with the measurements in odd ways. I
wonder if the way LSNs are reported by logical rep would need to be
changed first. I need to study this some more and would be grateful
for ideas from any of the logical rep people.

I added a fourth case 'overwhelm.png' which you might find
interesting. It's essentially like one 'burst' followed by a 100% ide
primary. The primary stops sending new WAL around 50 seconds in and
then there is no autovacuum, nothing happening at all. The standby
start is still replaying its backlog of WAL, but is sending back
replies only every 10 seconds (because no WAL arriving so no other
reason to send replies except status message timeout, which could be
lowered). So we see some big steps, and then we finally see it
flat-line around 60 seconds because there is still now new WAL so we
keep showing the last measured lag. If new WAL is flushed it will pop
back to 0ish, but until then its last known measurement is ~14
seconds, which I don't think is technically wrong.

If I understand what you're saying, "14 secs" would not be seen as the
correct answer by our users when the delay is now zero.

Solving that is where the keepalives need to come into play. If no new
WAL, send a keepalive and track the lag on that.

Hmm. Currently it works strictly with measurements of real WAL write,
flush and apply times. I rather like the simplicity of that
definition of the lag numbers, and the fact that they move only as a
result of genuine measured activity WAL. A keepalive message is never
written, flushed or applied, so if we had special cases here to show
constant 0 or measure keepalive round-trip time when we hit the end of
known WAL or something like that, the reported lag times for those
three operations wouldn't be true. In any real database cluster there
is real WAL being generated all the time, so after a big backload is
finally processed by a standby the "14 secs" won't linger for very
long, and during the time when you see that, it really is the last
true measured lag time.

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps you are thinking about the implications for alarm/monitoring
systems. If you were worried about this phenomenon you could set your
alarm condition to sent_location != replay_location AND replay_lag >
INTERVAL 'x seconds', but I'm not actually convinced that's necessary:
the worst it could do is prolong an alarm that had been correctly
triggered until some new WAL is observed being processed fast enough.
There is an argument that until you've actually made such an
observation, you don't actually know that the alarm deserves to be
shut off yet: perhaps this way avoids some flip-flopping.

So looks like we're almost there.

Thanks for the review and ideas!

Here is a new version that is rebased on top of the recent changes
ripping out floating point timestamps. Reading those commits made it
clear to me that I should be using TimeOffset for elapsed times, not
int64, so I changed that.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v5.patchapplication/octet-stream; name=replication-lag-v5.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fad5cb0..28984d0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1417,6 +1417,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the expected delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8973583..40b7b60 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11499,6 +11499,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11822,6 +11823,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..60047d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9cf9eb0..0805861 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +239,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +266,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1496,6 +1521,10 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	TimeOffset	writeLag,
+				flushLag,
+				applyLag;
+	TimestampTz now;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1510,6 +1539,12 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1525,6 +1560,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1913,6 +1954,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2238,6 +2282,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2687,6 +2757,17 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+offset_to_interval(TimeOffset offset)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+	result->time = offset;
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2695,7 +2776,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2743,6 +2824,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		TimeOffset	writeLag;
+		TimeOffset	flushLag;
+		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2757,6 +2841,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2798,7 +2885,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(offset_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(offset_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(offset_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2812,12 +2914,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2885,3 +2987,139 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		if (LagTracker.write_head > 0)
+			LagTracker.write_head--;
+		else
+			LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static TimeOffset
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return now - time;
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a4cc86d..14ea89c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..1211951 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Measured lag times. */
+	TimeOffset	writeLag;
+	TimeOffset	flushLag;
+	TimeOffset	applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c661f1d..6ebcabc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1831,10 +1831,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#34Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#33)
Re: Measuring replay lag

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Feb 24, 2017 at 9:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 21 February 2017 at 21:38, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

However, I think a call like LagTrackerWrite(SendRqstPtr,
GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
what happens in XLogSendPhysical. I'm not sure about that.

Me neither, but I think we need this for both physical and logical.

Same use cases graphs for both, I think. There might be issues with
the way LSNs work for logical.

This seems to be problematic. Logical peers report LSN changes for
all three operations (write, flush, commit) only on commit. I suppose
that might work OK for synchronous replication, but it makes it a bit
difficult to get lag measurements that don't look really strange and
sawtoothy when you have long transactions, and overlapping
transactions might interfere with the measurements in odd ways. I
wonder if the way LSNs are reported by logical rep would need to be
changed first. I need to study this some more and would be grateful
for ideas from any of the logical rep people.

I have no doubt there are problems with the nature of logical
replication that affect this. Those things are not the problem of this
patch but that doesn't push everything away.

What we want from this patch is something that works for both, as much
as that is possible.

With that in mind, this patch should be able to provide sensible lag
measurements from a simple case like logical replication of a standard
pgbench run. If that highlights problems with this patch then we can
fix them here.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#33)
Re: Measuring replay lag

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I added a fourth case 'overwhelm.png' which you might find
interesting. It's essentially like one 'burst' followed by a 100% ide
primary. The primary stops sending new WAL around 50 seconds in and
then there is no autovacuum, nothing happening at all. The standby
start is still replaying its backlog of WAL, but is sending back
replies only every 10 seconds (because no WAL arriving so no other
reason to send replies except status message timeout, which could be
lowered). So we see some big steps, and then we finally see it
flat-line around 60 seconds because there is still now new WAL so we
keep showing the last measured lag. If new WAL is flushed it will pop
back to 0ish, but until then its last known measurement is ~14
seconds, which I don't think is technically wrong.

If I understand what you're saying, "14 secs" would not be seen as the
correct answer by our users when the delay is now zero.

Solving that is where the keepalives need to come into play. If no new
WAL, send a keepalive and track the lag on that.

Hmm. Currently it works strictly with measurements of real WAL write,
flush and apply times. I rather like the simplicity of that
definition of the lag numbers, and the fact that they move only as a
result of genuine measured activity WAL. A keepalive message is never
written, flushed or applied, so if we had special cases here to show
constant 0 or measure keepalive round-trip time when we hit the end of
known WAL or something like that, the reported lag times for those
three operations wouldn't be true. In any real database cluster there
is real WAL being generated all the time, so after a big backload is
finally processed by a standby the "14 secs" won't linger for very
long, and during the time when you see that, it really is the last
true measured lag time.

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps I am misunderstanding the way it works.

If the last time WAL was generated the lag was 14 secs, then nothing
occurs for 2 hours aftwards AND all changes have been successfully
applied then it should not continue to show 14 secs for the next 2
hours.

IMHO the lag time should drop to zero in a reasonable time and stay at
zero for those 2 hours because there is no current lag.

If we want to show historical lag data, I'm supportive of the idea,
but we must report an accurate current value when the system is busy
and when the system is quiet.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Craig Ringer
craig@2ndquadrant.com
In reply to: Simon Riggs (#34)
Re: Measuring replay lag

On 5 March 2017 at 15:31, Simon Riggs <simon@2ndquadrant.com> wrote:

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

This seems to be problematic. Logical peers report LSN changes for
all three operations (write, flush, commit) only on commit. I suppose
that might work OK for synchronous replication, but it makes it a bit
difficult to get lag measurements that don't look really strange and
sawtoothy when you have long transactions, and overlapping
transactions might interfere with the measurements in odd ways.

They do, but those sawtoothy measurements are a true reflection of the
aspect of lag that's most important - what the downstream replica has
flushed to disk and made visible.

If we have xacts X1, X2 and X3, which commit in that order, then after
X1 commits the lag is the difference between X1's commit time and
"now". A peer only sends feedback updating flush position for commits.
Once X1 is confirmed replayed by the peer, the lag flush_location is
the difference between X2's later commit time and "now". And so on.

So I'd very much expect a sawtooth lag graph for flush_location,
because that's how logical replication really replays changes. We only
start replaying any xact once it commits on the upstream, and we
replay changes strictly in upstream commit order. It'll rise linearly
then fall vertically in abrupt drops.

sent_location is updated based on the last-decoded WAL position, per
src/backend/replication/walsender.c:2396 or so:

record = XLogReadRecord(logical_decoding_ctx->reader,
logical_startptr, &errm);
logical_startptr = InvalidXLogRecPtr;

/* xlog record was invalid */
if (errm != NULL)
elog(ERROR, "%s", errm);

if (record != NULL)
{
LogicalDecodingProcessRecord(logical_decoding_ctx,
logical_decoding_ctx->reader);

sentPtr = logical_decoding_ctx->reader->EndRecPtr;
}

so I would expect to see a smoother graph for sent_location based on
the last record processed by the XLogReader.

Though it's also a misleading graph, we haven't really sent that at
all, just decoded it and buffered it in a reorder buffer to send once
we decode a commit. Really, pg_stat_replication isn't quite expressive
enough to cover logical replication due to its reordering behaviour.
We can't really report the actual last LSN sent to the client very
easily, since we call into ReorderBufferCommit() and don't return
until we finish streaming the whole buffered xact, we'd need some kind
of callback to update the walsender with the lsn of the last row we
sent. And if we did this, sent_location would actually go backwards
sometimes, since usually with concurrent xacts the newest row in xact
committed at time T is newer, with higher LSN, than the oldest row in
xact with comit time T+n.

Later I'd like to add support for xact interleaving, where we can
speculatively stream rows from big in-progress xacts to the downstream
before the xact commits, and the downstream has to roll the xact back
if it aborts on the upstream. There are some snapshot management
issues to deal with there (not to mention downstream deadlock
hazards), but maybe we can limit the optimisation to xacts that made
no catalog changes to start with. I'm not at all sure how to report
sent_location then, though, or what a reasonable measure of lag will
look like.

What we want from this patch is something that works for both, as much
as that is possible.

If it shows a sawtooth pattern for flush lag, that's good, because
it's truthful. We can only flush after we replay commit, therefore lag
is always going to be sawtooth, with tooth size approximating xact
size and the baseline lag trend representing any sustained increase or
decrease in lag over time.

This would be extremely valuable to have.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Craig Ringer (#36)
1 attachment(s)
Re: Measuring replay lag

Hi,

Please see separate replies to Simon and Craig below.

On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps I am misunderstanding the way it works.

If the last time WAL was generated the lag was 14 secs, then nothing
occurs for 2 hours aftwards AND all changes have been successfully
applied then it should not continue to show 14 secs for the next 2
hours.

IMHO the lag time should drop to zero in a reasonable time and stay at
zero for those 2 hours because there is no current lag.

If we want to show historical lag data, I'm supportive of the idea,
but we must report an accurate current value when the system is busy
and when the system is quiet.

Ok, I thought about this for a bit and have a new idea that I hope
will be more acceptable. Here are the approaches considered:

1. Show the last measured lag times on a completely idle system until
such time as the standby eventually processes more lag, as I had it in
the v5 patch. You don't like that and I admit that it is not really
satisfying (even though I know that real Postgres systems alway
generate more WAL fairly soon even without user sessions, it's not
great that it depends on an unknown future event to clear the old
data).

2. Recognise when the last reported write/flush/apply LSN from the
standby == end of WAL on the sending server, and show lag times of
00:00:00 in all three columns. I consider this entirely bogus: it's
not an actual measurement that was ever made, and on an active system
it would flip-flop between real measurements and the artificial
00:00:00. I do not like this.

3. Recognise the end of WAL as above, but show SQL NULL in the
columns. Now we don't show an entirely bogus number like 00:00:00 but
we still have the flickering/flip-flopping between nothing and a
measured number during typical use (ie during short periods of
idleness between writes). I do not like this.

4. Somehow attempt to measure the round trip time for a keep-alive
message or similar during idle periods. This means that we would be
taking something that reflects one component of the usual lag
measurements, namely network transfer, but I think we would making
false claims when we show that in columns that report measured write
time, flush time and apply time. I do not like this.

5. The new proposal: Show only true measured write/flush/apply data,
as in 1, but with a time limit. To avoid the scenario where we show
the same times during prolonged periods of idleness, clear the numbers
like in option 3 after a period of idleness. This way we avoid the
dreaded flickering/flip-flopping. A natural time to do that is when
wal_receiver_status_interval expires on idle systems and defaults to
10 seconds.

Done using approach 5 in the attached version. Do you think this is a
good compromise? No bogus numbers, only true measured
write/flush/apply times, but a time limit on 'stale' lag information.

On Mon, Mar 6, 2017 at 3:22 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 March 2017 at 15:31, Simon Riggs <simon@2ndquadrant.com> wrote:

What we want from this patch is something that works for both, as much
as that is possible.

If it shows a sawtooth pattern for flush lag, that's good, because
it's truthful. We can only flush after we replay commit, therefore lag
is always going to be sawtooth, with tooth size approximating xact
size and the baseline lag trend representing any sustained increase or
decrease in lag over time.

This would be extremely valuable to have.

Thanks for your detailed explanation Craig. (I also had a chat with
Craig about this off-list.) Based on your feedback, I've added
support for reporting lag from logical replication, warts and all.

Just a thought: perhaps logical replication could consider
occasionally reporting a 'write' position based on decoded WAL written
to reorder buffers (rather than just reporting the apply LSN as write
LSN at commit time)? I think that would be interesting information in
its own right, but would also provide more opportunities to
interpolate the flush/apply sawtooth for large transactions.

Please find a new version attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v6.patchapplication/octet-stream; name=replication-lag-v6.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4d03531..9f96be6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1425,6 +1425,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the expected
+      delay that <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the expected delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> would incur while committing if this
+      server is configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e5362..f7269c3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11523,6 +11523,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11846,6 +11847,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0bce209..eff3c07 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dd3a936..dfe1ed8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -188,6 +188,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -219,6 +239,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -244,6 +266,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1501,6 +1526,13 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	TimeOffset	writeLag,
+				flushLag,
+				applyLag;
+	bool		clearLagTimes;
+	TimestampTz now;
+
+	static bool	fullyAppliedLastTime = false;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1515,6 +1547,30 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
+	/*
+	 * If the standby reports that it has fully replayed the WAL in two
+	 * consecutive reply messages, then the second such message must result
+	 * from wal_receiver_status_interval expiring on the standby.  This is a
+	 * convenient time to forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
+	 * until more WAL traffic arrives.
+	 */
+	clearLagTimes = false;
+	if (applyPtr == sentPtr)
+	{
+		if (fullyAppliedLastTime)
+			clearLagTimes = true;
+		fullyAppliedLastTime = true;
+	}
+	else
+		fullyAppliedLastTime = false;
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1530,6 +1586,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1 || clearLagTimes)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1 || clearLagTimes)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1 || clearLagTimes)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1918,6 +1980,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2243,6 +2308,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2396,6 +2487,10 @@ XLogSendLogical(void)
 
 	if (record != NULL)
 	{
+		/* See explanation in XLogSendPhysical. */
+		LagTrackerWrite(logical_decoding_ctx->reader->EndRecPtr,
+						GetCurrentTimestamp());
+
 		LogicalDecodingProcessRecord(logical_decoding_ctx, logical_decoding_ctx->reader);
 
 		sentPtr = logical_decoding_ctx->reader->EndRecPtr;
@@ -2692,6 +2787,17 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+offset_to_interval(TimeOffset offset)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+	result->time = offset;
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2700,7 +2806,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2748,6 +2854,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		TimeOffset	writeLag;
+		TimeOffset	flushLag;
+		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2762,6 +2871,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2803,7 +2915,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(offset_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(offset_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(offset_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2817,12 +2944,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2890,3 +3017,139 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		if (LagTracker.write_head > 0)
+			LagTracker.write_head--;
+		else
+			LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static TimeOffset
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return now - time;
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ec4aedb..af5b50d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2772,7 +2772,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..2c59056 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Measured lag times, or -1 for unknown/none. */
+	TimeOffset	writeLag;
+	TimeOffset	flushLag;
+	TimeOffset	applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bd13ae6..55b5ca7 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1831,10 +1831,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#38Ian Barwick
ian.barwick@2ndquadrant.com
In reply to: Thomas Munro (#37)
Re: Measuring replay lag

Hi

Just adding a couple of thoughts on this.

On 03/14/2017 08:39 AM, Thomas Munro wrote:

Hi,

Please see separate replies to Simon and Craig below.

On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps I am misunderstanding the way it works.

If the last time WAL was generated the lag was 14 secs, then nothing
occurs for 2 hours aftwards AND all changes have been successfully
applied then it should not continue to show 14 secs for the next 2
hours.

IMHO the lag time should drop to zero in a reasonable time and stay at
zero for those 2 hours because there is no current lag.

If we want to show historical lag data, I'm supportive of the idea,
but we must report an accurate current value when the system is busy
and when the system is quiet.

Ok, I thought about this for a bit and have a new idea that I hope
will be more acceptable. Here are the approaches considered:

(...)

2. Recognise when the last reported write/flush/apply LSN from the
standby == end of WAL on the sending server, and show lag times of
00:00:00 in all three columns. I consider this entirely bogus: it's
not an actual measurement that was ever made, and on an active system
it would flip-flop between real measurements and the artificial
00:00:00. I do not like this.

I agree with this; while initially I was expecting to see 00:00:00,
SQL NULL is definitely correct here. Anyone writing tools etc. which need to
report an actual interval can convert this to 00:00:00 easily enough .

(...)

5. The new proposal: Show only true measured write/flush/apply data,
as in 1, but with a time limit. To avoid the scenario where we show
the same times during prolonged periods of idleness, clear the numbers
like in option 3 after a period of idleness. This way we avoid the
dreaded flickering/flip-flopping. A natural time to do that is when
wal_receiver_status_interval expires on idle systems and defaults to
10 seconds.

Done using approach 5 in the attached version. Do you think this is a
good compromise? No bogus numbers, only true measured
write/flush/apply times, but a time limit on 'stale' lag information.

This makes sense to me. I'd also add that while on production servers
it's likely there'll be enough activity to keep the columns updated,
on a quiescent test/development systems seeing a stale value looks plain
wrong (and will cause no end of questions from people asking why lag
is still showing when their system isn't doing anything).

I suggest the documentation of these columns needs to be extended to mention
that they will be NULL if no lag was measured recently, and to explain
the circumstances in which the numbers are cleared.

Regards

Ian Barwick

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#37)
Re: Measuring replay lag

On 14 March 2017 at 07:39, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Mon, Mar 6, 2017 at 3:22 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

On 5 March 2017 at 15:31, Simon Riggs <simon@2ndquadrant.com> wrote:

What we want from this patch is something that works for both, as much
as that is possible.

If it shows a sawtooth pattern for flush lag, that's good, because
it's truthful. We can only flush after we replay commit, therefore lag
is always going to be sawtooth, with tooth size approximating xact
size and the baseline lag trend representing any sustained increase or
decrease in lag over time.

This would be extremely valuable to have.

Thanks for your detailed explanation Craig. (I also had a chat with
Craig about this off-list.) Based on your feedback, I've added
support for reporting lag from logical replication, warts and all.

Just a thought: perhaps logical replication could consider
occasionally reporting a 'write' position based on decoded WAL written
to reorder buffers (rather than just reporting the apply LSN as write
LSN at commit time)? I think that would be interesting information in
its own right, but would also provide more opportunities to
interpolate the flush/apply sawtooth for large transactions.

Please find a new version attached.

My summary is that with logical the values only change at commit time.
With a stream of small transactions there shouldn't be any noticeable
sawtooth.

Please put in a substantive comment, rather than just "See explanation
in XLogSendPhysical" cos that's clearly not enough. Please write docs
so I can commit this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#37)
Re: Measuring replay lag

On 14 March 2017 at 07:39, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hi,

Please see separate replies to Simon and Craig below.

On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 1 March 2017 at 10:47, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps I am misunderstanding the way it works.

If the last time WAL was generated the lag was 14 secs, then nothing
occurs for 2 hours aftwards AND all changes have been successfully
applied then it should not continue to show 14 secs for the next 2
hours.

IMHO the lag time should drop to zero in a reasonable time and stay at
zero for those 2 hours because there is no current lag.

If we want to show historical lag data, I'm supportive of the idea,
but we must report an accurate current value when the system is busy
and when the system is quiet.

Ok, I thought about this for a bit and have a new idea that I hope
will be more acceptable. Here are the approaches considered:

1. Show the last measured lag times on a completely idle system until
such time as the standby eventually processes more lag, as I had it in
the v5 patch. You don't like that and I admit that it is not really
satisfying (even though I know that real Postgres systems alway
generate more WAL fairly soon even without user sessions, it's not
great that it depends on an unknown future event to clear the old
data).

2. Recognise when the last reported write/flush/apply LSN from the
standby == end of WAL on the sending server, and show lag times of
00:00:00 in all three columns. I consider this entirely bogus: it's
not an actual measurement that was ever made, and on an active system
it would flip-flop between real measurements and the artificial
00:00:00. I do not like this.

There are two ways of knowing the lag: 1) by measurement/sampling,
which is the main way this patch approaches this, 2) by direct
observation the LSNs match. Both are equally valid ways of
establishing knowledge. Strangely (2) is the only one of those that is
actually precise and yet you say it is bogus. It is actually the
measurements which are approximations of the actual state.

The reality is that the lag can change dis-continuously between zero
and non-zero. I don't think we should hide that from people.

I suspect that your "entirely bogus" feeling comes from the point that
we actually have 3 states, one of which has unknown lag.

A) "Currently caught-up"
WALSender LSN == WALReceiver LSN (info type (1))
At this point the current lag is known precisely to be zero.

B) "Work outstanding, no reply yet"
Immediately after where WALSenderLSN > WALReceiverLSN, yet we haven't
yet received new reply
We expect to stay in this state for however long it takes to receive a
reply, which could be wal_receiver_status_interval or longer if the
lag is greater. At this point we have no measurement of what the lag
is. We could reply NULL since we don't know. We could reply with the
last measured lag when we were last in state C, but if the new reply
was delayed for more than that we'd need to reply that the lag is at
least as high as the delay since last time we left state A.

C) "Continuous flow"
WALSenderLSN > WALReceiverLSN and we have received a reply
(measurement, info type (2))
This is the main case. Easy-ish!

So I think we need to first agree that A and B states exist and how to
report lag in each state.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#40)
Re: Measuring replay lag

On Thu, Mar 16, 2017 at 12:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

There are two ways of knowing the lag: 1) by measurement/sampling,
which is the main way this patch approaches this, 2) by direct
observation the LSNs match. Both are equally valid ways of
establishing knowledge. Strangely (2) is the only one of those that is
actually precise and yet you say it is bogus. It is actually the
measurements which are approximations of the actual state.

The reality is that the lag can change dis-continuously between zero
and non-zero. I don't think we should hide that from people.

I suspect that your "entirely bogus" feeling comes from the point that
we actually have 3 states, one of which has unknown lag.

A) "Currently caught-up"
WALSender LSN == WALReceiver LSN (info type (1))
At this point the current lag is known precisely to be zero.

B) "Work outstanding, no reply yet"
Immediately after where WALSenderLSN > WALReceiverLSN, yet we haven't
yet received new reply
We expect to stay in this state for however long it takes to receive a
reply, which could be wal_receiver_status_interval or longer if the
lag is greater. At this point we have no measurement of what the lag
is. We could reply NULL since we don't know. We could reply with the
last measured lag when we were last in state C, but if the new reply
was delayed for more than that we'd need to reply that the lag is at
least as high as the delay since last time we left state A.

C) "Continuous flow"
WALSenderLSN > WALReceiverLSN and we have received a reply
(measurement, info type (2))
This is the main case. Easy-ish!

So I think we need to first agree that A and B states exist and how to
report lag in each state.

I agree that these states exist, but we disagree on what 'lag' really
means, or, rather, which of several plausible definitions would be the
most useful here.

My proposal is that the *_lag columns should always report how long it
took for recently written, flushed and applied WAL to be written,
flushed and applied (and for the primary to know about it). By this
definition, sent LSN = applied LSN is not a special case: we simply
report how long that LSN took to be written, flushed and applied.

Your proposal is that the *_lag columns should report how far in the
past the standby is at each of the three stages with respect to the
current end of WAL. By this definition when sent LSN = applied LSN we
are currently in the 'A' state meaning 'caught up' and should show
00:00:00.

Here are two reasons I prefer my definition:

* you can trivially convert from my definition to yours on the basis
of existing information: CASE WHEN sent_location = replay_location
THEN '00:00:00'::interval ELSE replay_lag END, but there is no way to
get from your definition to mine

* lag numbers reported using my definition tell you how long each of
the synchronous replication levels take, but with your definition they
only do that if you catch them during times when they aren't showing
the special case 00:00:00; a fast standby running any workload other
than a benchmark is often going to show all-caught-up 00:00:00 so the
new columns will be useless for that purpose

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#41)
Re: Measuring replay lag

On 16 March 2017 at 08:02, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I agree that these states exist, but we disagree on what 'lag' really
means, or, rather, which of several plausible definitions would be the
most useful here.

My proposal is that the *_lag columns should always report how long it
took for recently written, flushed and applied WAL to be written,
flushed and applied (and for the primary to know about it). By this
definition, sent LSN = applied LSN is not a special case: we simply
report how long that LSN took to be written, flushed and applied.

Your proposal is that the *_lag columns should report how far in the
past the standby is at each of the three stages with respect to the
current end of WAL. By this definition when sent LSN = applied LSN we
are currently in the 'A' state meaning 'caught up' and should show
00:00:00.

I accept your proposal for how we handle these, on condition that you
write up some docs that explain the subtle difference between the two,
so we can just show people the URL. That needs to explain clearly the
difference in an impartial way between "what is the most recent lag
measurement" and "how long until we are caught up" as possible
intrepretations of these values. Thanks.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43David Steele
david@pgmasters.net
In reply to: Simon Riggs (#42)
Re: Measuring replay lag

Hi Thomas,

On 3/15/17 8:38 PM, Simon Riggs wrote:

On 16 March 2017 at 08:02, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

I agree that these states exist, but we disagree on what 'lag' really
means, or, rather, which of several plausible definitions would be the
most useful here.

My proposal is that the *_lag columns should always report how long it
took for recently written, flushed and applied WAL to be written,
flushed and applied (and for the primary to know about it). By this
definition, sent LSN = applied LSN is not a special case: we simply
report how long that LSN took to be written, flushed and applied.

Your proposal is that the *_lag columns should report how far in the
past the standby is at each of the three stages with respect to the
current end of WAL. By this definition when sent LSN = applied LSN we
are currently in the 'A' state meaning 'caught up' and should show
00:00:00.

I accept your proposal for how we handle these, on condition that you
write up some docs that explain the subtle difference between the two,
so we can just show people the URL. That needs to explain clearly the
difference in an impartial way between "what is the most recent lag
measurement" and "how long until we are caught up" as possible
intrepretations of these values. Thanks.

This thread has been idle for six days. Please respond and/or post a
new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be
marked "Returned with Feedback".

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Simon Riggs
simon@2ndquadrant.com
In reply to: David Steele (#43)
Re: Measuring replay lag

On 21 March 2017 at 17:32, David Steele <david@pgmasters.net> wrote:

Hi Thomas,

On 3/15/17 8:38 PM, Simon Riggs wrote:

On 16 March 2017 at 08:02, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

I agree that these states exist, but we disagree on what 'lag' really
means, or, rather, which of several plausible definitions would be the
most useful here.

My proposal is that the *_lag columns should always report how long it
took for recently written, flushed and applied WAL to be written,
flushed and applied (and for the primary to know about it). By this
definition, sent LSN = applied LSN is not a special case: we simply
report how long that LSN took to be written, flushed and applied.

Your proposal is that the *_lag columns should report how far in the
past the standby is at each of the three stages with respect to the
current end of WAL. By this definition when sent LSN = applied LSN we
are currently in the 'A' state meaning 'caught up' and should show
00:00:00.

I accept your proposal for how we handle these, on condition that you
write up some docs that explain the subtle difference between the two,
so we can just show people the URL. That needs to explain clearly the
difference in an impartial way between "what is the most recent lag
measurement" and "how long until we are caught up" as possible
intrepretations of these values. Thanks.

This thread has been idle for six days. Please respond and/or post a new
patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".

Thomas, Are you working on another version even? You've not replied to
me or David, so its difficult to know what next.

Not sure whether this a 6 day lag, or we should show NULL because we
are up to date.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#44)
Re: Measuring replay lag

On Wed, Mar 22, 2017 at 11:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I accept your proposal for how we handle these, on condition that you
write up some docs that explain the subtle difference between the two,
so we can just show people the URL. That needs to explain clearly the
difference in an impartial way between "what is the most recent lag
measurement" and "how long until we are caught up" as possible
intrepretations of these values. Thanks.

This thread has been idle for six days. Please respond and/or post a new
patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".

Thomas, Are you working on another version even? You've not replied to
me or David, so its difficult to know what next.

Not sure whether this a 6 day lag, or we should show NULL because we
are up to date.

Hah. Apologies for the delay -- I will post a patch with
documentation as requested within 24 hours.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#45)
Re: Measuring replay lag

On 22 March 2017 at 11:03, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hah. Apologies for the delay -- I will post a patch with
documentation as requested within 24 hours.

Thanks very much. I'll reserve time to commit it tomorrow, all else being good.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#44)
Re: Measuring replay lag

On Wed, Mar 22, 2017 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not sure whether this a 6 day lag, or we should show NULL because we
are up to date.

OK, that made me laugh.

Thanks for putting in the effort on this patch, BTW.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#46)
1 attachment(s)
Re: Measuring replay lag

On Thu, Mar 23, 2017 at 12:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 March 2017 at 11:03, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hah. Apologies for the delay -- I will post a patch with
documentation as requested within 24 hours.

Thanks very much. I'll reserve time to commit it tomorrow, all else being good.

Thanks! Please find attached v7, which includes a note we can point
at when someone asks why it doesn't show 00:00:00, as requested.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

replication-lag-v7.patchapplication/octet-stream; name=replication-lag-v7.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb2d33..d425037 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1696,6 +1696,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
@@ -1745,6 +1775,38 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    listed; no information is available about downstream standby servers.
   </para>
 
+  <para>
+   The 'lag' times reported in the <structname>pg_stat_replication</structname>
+   view are measurements of the time taken for recent WAL to be written,
+   flushed and replayed and for the sender to know about it.  These times
+   represent the commit delay that was (or would have been) introduced by each
+   synchronous commit level, if the remote server was configured as a
+   synchronous standby.  For an asynchronous standby, the
+   <structfield>replay_lag</structfield> column approximates the delay
+   before recent transactions became visible to queries.  If the standby
+   server has entirely caught up with the sending server and there is no more
+   WAL activity, the most recently measured lag times will continue to be
+   displayed for a short time and then show NULL.
+  </para>
+
+  <note>
+   <para>
+    The reported lag times are not predictions of how long it will take for
+    the standby to catch up with the sending server assuming the current
+    rate of replay.  Such a system would show similar times while new WAL is
+    being generated, but would differ when the sender becomes idle.  In
+    particular, when the standby has caught up completely, 
+    <structname>pg_stat_replication</structname> shows the time taken to
+    write, flush and replay the most recent reported WAL position rather than
+    zero as some users might expect.  This is consistent with the goal of
+    measuring synchronous commit and transaction visibility delays for
+    recent write transactions.
+    To reduce confusion for users expecting a different model of lag, the
+    lag columns revert to NULL after a short time on a fully replayed idle
+    system.
+   </para>
+  </note>
+
   <table id="pg-stat-wal-receiver-view" xreflabel="pg_stat_wal_receiver">
    <title><structname>pg_stat_wal_receiver</structname> View</title>
    <tgroup cols="3">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9480377..92b2972 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11554,6 +11554,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11877,6 +11878,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6552da..c109ae8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..e7f19bc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -190,6 +190,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -221,6 +241,8 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -246,6 +268,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1556,6 +1581,13 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	TimeOffset	writeLag,
+				flushLag,
+				applyLag;
+	bool		clearLagTimes;
+	TimestampTz now;
+
+	static bool	fullyAppliedLastTime = false;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1570,6 +1602,30 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
+	/*
+	 * If the standby reports that it has fully replayed the WAL in two
+	 * consecutive reply messages, then the second such message must result
+	 * from wal_receiver_status_interval expiring on the standby.  This is a
+	 * convenient time to forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
+	 * until more WAL traffic arrives.
+	 */
+	clearLagTimes = false;
+	if (applyPtr == sentPtr)
+	{
+		if (fullyAppliedLastTime)
+			clearLagTimes = true;
+		fullyAppliedLastTime = true;
+	}
+	else
+		fullyAppliedLastTime = false;
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1585,6 +1641,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1 || clearLagTimes)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1 || clearLagTimes)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1 || clearLagTimes)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1973,6 +2035,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2300,6 +2365,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2453,6 +2544,10 @@ XLogSendLogical(void)
 
 	if (record != NULL)
 	{
+		/* See explanation in XLogSendPhysical. */
+		LagTrackerWrite(logical_decoding_ctx->reader->EndRecPtr,
+						GetCurrentTimestamp());
+
 		LogicalDecodingProcessRecord(logical_decoding_ctx, logical_decoding_ctx->reader);
 
 		sentPtr = logical_decoding_ctx->reader->EndRecPtr;
@@ -2749,6 +2844,17 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+offset_to_interval(TimeOffset offset)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+	result->time = offset;
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2757,7 +2863,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2805,6 +2911,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		TimeOffset	writeLag;
+		TimeOffset	flushLag;
+		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2819,6 +2928,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2860,7 +2972,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(offset_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(offset_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(offset_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2874,12 +3001,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2947,3 +3074,139 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ */
+static void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		if (LagTracker.write_head > 0)
+			LagTracker.write_head--;
+		else
+			LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static TimeOffset
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return now - time;
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 836d6ff..2b9a3c6 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2801,7 +2801,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..2c59056 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Measured lag times, or -1 for unknown/none. */
+	TimeOffset	writeLag;
+	TimeOffset	flushLag;
+	TimeOffset	applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bd13ae6..55b5ca7 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1831,10 +1831,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#49Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Ian Barwick (#38)
Re: Measuring replay lag

On Wed, Mar 15, 2017 at 8:15 PM, Ian Barwick
<ian.barwick@2ndquadrant.com> wrote:

2. Recognise when the last reported write/flush/apply LSN from the
standby == end of WAL on the sending server, and show lag times of
00:00:00 in all three columns. I consider this entirely bogus: it's
not an actual measurement that was ever made, and on an active system
it would flip-flop between real measurements and the artificial
00:00:00. I do not like this.

I agree with this; while initially I was expecting to see 00:00:00,
SQL NULL is definitely correct here. Anyone writing tools etc. which need to
report an actual interval can convert this to 00:00:00 easily enough .

Right.

Another point here is that if someone really wants to see "estimated
time until caught up to current end of WAL" where 00:00:00 makes sense
when fully replayed, then it is already possible to compute that using
information that is published in pg_stat_replication in 9.6.

An external tool or a plpgsql function could do something like:
observe replay_location twice with a sleep in between to estimate the
current rate of replay, then divide the current distance between
replay location and end-of-WAL by the replay rate to estimate the time
of arrival. I think the result would behave a bit like the infamous
Windows file transfer dialogue ("3 seconds, not 7 months, no 4
seconds, no INFINITY, oh wait 0 seconds, you have arrived at your
destination!") due to the lumpiness of replay, though perhaps that
could be corrected by applying the right kind of smoothing to the
rate. I thought about something like that but figured it would be
better to stick to measuring facts about the past rather than making
predictions about the future. That's on top of the more serious
problem for the syncrep delay measurement use case, where I started
this journey, that many systems would show zero whenever you query
them because they often catch up in between writes, even though sync
rep is not free.

5. The new proposal: Show only true measured write/flush/apply data,
as in 1, but with a time limit. To avoid the scenario where we show
the same times during prolonged periods of idleness, clear the numbers
like in option 3 after a period of idleness. This way we avoid the
dreaded flickering/flip-flopping. A natural time to do that is when
wal_receiver_status_interval expires on idle systems and defaults to
10 seconds.

Done using approach 5 in the attached version. Do you think this is a
good compromise? No bogus numbers, only true measured
write/flush/apply times, but a time limit on 'stale' lag information.

This makes sense to me. I'd also add that while on production servers
it's likely there'll be enough activity to keep the columns updated,
on a quiescent test/development systems seeing a stale value looks plain
wrong (and will cause no end of questions from people asking why lag
is still showing when their system isn't doing anything).

Cool, and now Simon has agreed to this too.

I suggest the documentation of these columns needs to be extended to mention
that they will be NULL if no lag was measured recently, and to explain
the circumstances in which the numbers are cleared.

Done in the v7. Thanks for the feedback!

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Simon Riggs
simon@2ndquadrant.com
In reply to: Thomas Munro (#48)
Re: Measuring replay lag

On 23 March 2017 at 01:02, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Thanks! Please find attached v7, which includes a note we can point
at when someone asks why it doesn't show 00:00:00, as requested.

Thanks.

Now I look harder the handling for logical lag seems like it would be
problematic in many cases. It's up to the plugin whether it sends
anything at all, so we should make a LagTrackerWrite() call only if
the plugin sends something. Otherwise the lag tracker will just slow
down logical replication.

What I think we should do is add an LSN onto LogicalDecodingContext to
represent the last LSN sent by the plugin, if any.

If that advances after the call to LogicalDecodingProcessRecord() then
we know we just sent a message and we can track that with
LagTrackerWrite().

So we make it the plugin's responsibility to maintain this LSN
correctly, if at all. (It may decide not to)

In English that means the plugin will update the LSN after each
Commit, and since we reply only on commit this will provide a series
of measurements we can use. It will still give a saw-tooth, but its
better than flooding the LagTracker with false measurements.

I think it seems easier to add that as a minor cleanup/open item after
this commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#50)
Re: Measuring replay lag

On 23 March 2017 at 06:42, Simon Riggs <simon@2ndquadrant.com> wrote:

On 23 March 2017 at 01:02, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Thanks! Please find attached v7, which includes a note we can point
at when someone asks why it doesn't show 00:00:00, as requested.

Thanks.

Now I look harder the handling for logical lag seems like it would be
problematic in many cases. It's up to the plugin whether it sends
anything at all, so we should make a LagTrackerWrite() call only if
the plugin sends something. Otherwise the lag tracker will just slow
down logical replication.

What I think we should do is add an LSN onto LogicalDecodingContext to
represent the last LSN sent by the plugin, if any.

If that advances after the call to LogicalDecodingProcessRecord() then
we know we just sent a message and we can track that with
LagTrackerWrite().

So we make it the plugin's responsibility to maintain this LSN
correctly, if at all. (It may decide not to)

In English that means the plugin will update the LSN after each
Commit, and since we reply only on commit this will provide a series
of measurements we can use. It will still give a saw-tooth, but its
better than flooding the LagTracker with false measurements.

I think it seems easier to add that as a minor cleanup/open item after
this commit.

Second thoughts... I'll just make LagTrackerWrite externally
available, so a plugin can send anything it wants to the tracker.
Which means I'm explicitly removing the "logical replication support"
from this patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#51)
1 attachment(s)
Re: Measuring replay lag

Second thoughts... I'll just make LagTrackerWrite externally
available, so a plugin can send anything it wants to the tracker.
Which means I'm explicitly removing the "logical replication support"
from this patch.

Done.

Here's the patch I'm looking to commit, with some docs and minor code
changes as discussed.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

replication-lag-v7sr1.patchapplication/octet-stream; name=replication-lag-v7sr1.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dcb2d33..eda14cf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1696,6 +1696,36 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       standby server</entry>
     </row>
     <row>
+     <entry><structfield>write_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written it (but not yet
+      flushed it or applied it).  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_write</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>flush_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written and flushed it
+      (but not yet applied it).  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_flush</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
+     <entry><structfield>replay_lag</></entry>
+     <entry><type>interval</></entry>
+     <entry>Time elapsed between flushing recent WAL locally and receiving
+      notification that this standby server has written, flushed and
+      applied it.  This can be used to gauge the delay that
+      <literal>synchronous_commit</literal> level
+      <literal>remote_apply</literal> incurred while committing if this
+      server was configured as a synchronous standby.</entry>
+    </row>
+    <row>
      <entry><structfield>sync_priority</></entry>
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
@@ -1745,6 +1775,45 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    listed; no information is available about downstream standby servers.
   </para>
 
+  <para>
+   The lag times reported in the <structname>pg_stat_replication</structname>
+   view are measurements of the time taken for recent WAL to be written,
+   flushed and replayed and for the sender to know about it.  These times
+   represent the commit delay that was (or would have been) introduced by each
+   synchronous commit level, if the remote server was configured as a
+   synchronous standby.  For an asynchronous standby, the
+   <structfield>replay_lag</structfield> column approximates the delay
+   before recent transactions became visible to queries.  If the standby
+   server has entirely caught up with the sending server and there is no more
+   WAL activity, the most recently measured lag times will continue to be
+   displayed for a short time and then show NULL.
+  </para>
+
+  <para>
+   Lag times work automatically for physical replication. Logical decoding
+   plugins may optionally emit tracking messages; if they do not, the tracking
+   mechanism will simply display NULL lag.
+  </para>
+
+  <note>
+   <para>
+    The reported lag times are not predictions of how long it will take for
+    the standby to catch up with the sending server assuming the current
+    rate of replay.  Such a system would show similar times while new WAL is
+    being generated, but would differ when the sender becomes idle.  In
+    particular, when the standby has caught up completely, 
+    <structname>pg_stat_replication</structname> shows the time taken to
+    write, flush and replay the most recent reported WAL position rather than
+    zero as some users might expect.  This is consistent with the goal of
+    measuring synchronous commit and transaction visibility delays for
+    recent write transactions.
+    To reduce confusion for users expecting a different model of lag, the
+    lag columns revert to NULL after a short time on a fully replayed idle
+    system. Monitoring systems should choose whether the represent this
+    as missing data, zero or continue to display the last known value.
+   </para>
+  </note>
+
   <table id="pg-stat-wal-receiver-view" xreflabel="pg_stat_wal_receiver">
    <title><structname>pg_stat_wal_receiver</structname> View</title>
    <tgroup cols="3">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff4cf3a..eca3c04 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11554,6 +11554,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
+	bool		streaming_reply_sent = false;
 
 	/*-------
 	 * Standby mode is implemented by a state machine:
@@ -11877,6 +11878,19 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					}
 
 					/*
+					 * Since we have replayed everything we have received so
+					 * far and are about to start waiting for more WAL, let's
+					 * tell the upstream server our replay location now so
+					 * that pg_stat_replication doesn't show stale
+					 * information.
+					 */
+					if (!streaming_reply_sent)
+					{
+						WalRcvForceReply();
+						streaming_reply_sent = true;
+					}
+
+					/*
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly.
 					 */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6552da..c109ae8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -705,6 +705,9 @@ CREATE VIEW pg_stat_replication AS
             W.write_location,
             W.flush_location,
             W.replay_location,
+            W.write_lag,
+            W.flush_lag,
+            W.replay_lag,
             W.sync_priority,
             W.sync_state
     FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7561770..29d1319 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -190,6 +190,26 @@ static volatile sig_atomic_t replication_active = false;
 static LogicalDecodingContext *logical_decoding_ctx = NULL;
 static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/* A sample associating a log position with the time it was written. */
+typedef struct
+{
+	XLogRecPtr lsn;
+	TimestampTz time;
+} WalTimeSample;
+
+/* The size of our buffer of time samples. */
+#define LAG_TRACKER_BUFFER_SIZE 8192
+
+/* A mechanism for tracking replication lag. */
+static struct
+{
+	XLogRecPtr last_lsn;
+	WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];
+	int write_head;
+	int read_heads[NUM_SYNC_REP_WAIT_MODE];
+	WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];
+} LagTracker;
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -221,6 +241,7 @@ static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
 static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc);
+static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
 
 static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -246,6 +267,9 @@ InitWalSender(void)
 	 */
 	MarkPostmasterChildWalSender();
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+
+	/* Initialize empty timestamp buffer for lag tracking. */
+	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
 /*
@@ -1556,6 +1580,13 @@ ProcessStandbyReplyMessage(void)
 				flushPtr,
 				applyPtr;
 	bool		replyRequested;
+	TimeOffset	writeLag,
+				flushLag,
+				applyLag;
+	bool		clearLagTimes;
+	TimestampTz now;
+
+	static bool	fullyAppliedLastTime = false;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1570,6 +1601,30 @@ ProcessStandbyReplyMessage(void)
 		 (uint32) (applyPtr >> 32), (uint32) applyPtr,
 		 replyRequested ? " (reply requested)" : "");
 
+	/* See if we can compute the round-trip lag for these positions. */
+	now = GetCurrentTimestamp();
+	writeLag = LagTrackerRead(SYNC_REP_WAIT_WRITE, writePtr, now);
+	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
+	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
+
+	/*
+	 * If the standby reports that it has fully replayed the WAL in two
+	 * consecutive reply messages, then the second such message must result
+	 * from wal_receiver_status_interval expiring on the standby.  This is a
+	 * convenient time to forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
+	 * until more WAL traffic arrives.
+	 */
+	clearLagTimes = false;
+	if (applyPtr == sentPtr)
+	{
+		if (fullyAppliedLastTime)
+			clearLagTimes = true;
+		fullyAppliedLastTime = true;
+	}
+	else
+		fullyAppliedLastTime = false;
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false);
@@ -1585,6 +1640,12 @@ ProcessStandbyReplyMessage(void)
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
 		walsnd->apply = applyPtr;
+		if (writeLag != -1 || clearLagTimes)
+			walsnd->writeLag = writeLag;
+		if (flushLag != -1 || clearLagTimes)
+			walsnd->flushLag = flushLag;
+		if (applyLag != -1 || clearLagTimes)
+			walsnd->applyLag = applyLag;
 		SpinLockRelease(&walsnd->mutex);
 	}
 
@@ -1973,6 +2034,9 @@ InitWalSenderSlot(void)
 			walsnd->write = InvalidXLogRecPtr;
 			walsnd->flush = InvalidXLogRecPtr;
 			walsnd->apply = InvalidXLogRecPtr;
+			walsnd->writeLag = -1;
+			walsnd->flushLag = -1;
+			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
@@ -2300,6 +2364,32 @@ XLogSendPhysical(void)
 	}
 
 	/*
+	 * Record the current system time as an approximation of the time at which
+	 * this WAL position was written for the purposes of lag tracking.
+	 *
+	 * In theory we could make XLogFlush() record a time in shmem whenever WAL
+	 * is flushed and we could get that time as well as the LSN when we call
+	 * GetFlushRecPtr() above (and likewise for the cascading standby
+	 * equivalent), but rather than putting any new code into the hot WAL path
+	 * it seems good enough to capture the time here.  We should reach this
+	 * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that
+	 * may take some time, we read the WAL flush pointer and take the time
+	 * very close to together here so that we'll get a later position if it
+	 * is still moving.
+	 *
+	 * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,
+	 * this gives us a cheap approximation for the WAL flush time for this
+	 * LSN.
+	 *
+	 * Note that the LSN is not necessarily the LSN for the data contained in
+	 * the present message; it's the end of the the WAL, which might be
+	 * further ahead.  All the lag tracking machinery cares about is finding
+	 * out when that arbitrary LSN is eventually reported as written, flushed
+	 * and applied, so that it can measure the elapsed time.
+	 */
+	LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());
+
+	/*
 	 * If this is a historic timeline and we've reached the point where we
 	 * forked to the next timeline, stop streaming.
 	 *
@@ -2453,6 +2543,11 @@ XLogSendLogical(void)
 
 	if (record != NULL)
 	{
+		/*
+		 * Note the lack of any call to LagTrackerWrite() which is the responsibility
+		 * of the logical decoding plugin. Response messages are handled normally,
+		 * so this responsibility does not extend to needing to call LagTrackerRead().
+		 */
 		LogicalDecodingProcessRecord(logical_decoding_ctx, logical_decoding_ctx->reader);
 
 		sentPtr = logical_decoding_ctx->reader->EndRecPtr;
@@ -2749,6 +2844,17 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+static Interval *
+offset_to_interval(TimeOffset offset)
+{
+	Interval *result = palloc(sizeof(Interval));
+
+	result->month = 0;
+	result->day = 0;
+	result->time = offset;
+
+	return result;
+}
 
 /*
  * Returns activity of walsenders, including pids and xlog locations sent to
@@ -2757,7 +2863,7 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	8
+#define PG_STAT_GET_WAL_SENDERS_COLS	11
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2805,6 +2911,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
+		TimeOffset	writeLag;
+		TimeOffset	flushLag;
+		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -2819,6 +2928,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
+		writeLag = walsnd->writeLag;
+		flushLag = walsnd->flushLag;
+		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		SpinLockRelease(&walsnd->mutex);
 
@@ -2860,7 +2972,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			priority = XLogRecPtrIsInvalid(walsnd->flush) ? 0 : priority;
 
-			values[6] = Int32GetDatum(priority);
+			if (writeLag < 0)
+				nulls[6] = true;
+			else
+				values[6] = IntervalPGetDatum(offset_to_interval(writeLag));
+
+			if (flushLag < 0)
+				nulls[7] = true;
+			else
+				values[7] = IntervalPGetDatum(offset_to_interval(flushLag));
+
+			if (applyLag < 0)
+				nulls[8] = true;
+			else
+				values[8] = IntervalPGetDatum(offset_to_interval(applyLag));
+
+			values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
@@ -2874,12 +3001,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * states. We report just "quorum" for them.
 			 */
 			if (priority == 0)
-				values[7] = CStringGetTextDatum("async");
+				values[10] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
+				values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
-				values[7] = CStringGetTextDatum("potential");
+				values[10] = CStringGetTextDatum("potential");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2947,3 +3074,143 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 			WalSndShutdown();
 	}
 }
+
+/*
+ * Record the end of the WAL and the time it was flushed locally, so that
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is
+ * eventually reported to have been written, flushed and applied by the
+ * standby in a reply message.
+ * Exported to allow logical decoding plugins to call this when they choose.
+ */
+void
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
+{
+	bool buffer_full;
+	int new_write_head;
+	int i;
+
+	if (!am_walsender)
+		return;
+
+	/*
+	 * If the lsn hasn't advanced since last time, then do nothing.  This way
+	 * we only record a new sample when new WAL has been written.
+	 */
+	if (LagTracker.last_lsn == lsn)
+		return;
+	LagTracker.last_lsn = lsn;
+
+	/*
+	 * If advancing the write head of the circular buffer would crash into any
+	 * of the read heads, then the buffer is full.  In other words, the
+	 * slowest reader (presumably apply) is the one that controls the release
+	 * of space.
+	 */
+	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
+	buffer_full = false;
+	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	{
+		if (new_write_head == LagTracker.read_heads[i])
+			buffer_full = true;
+	}
+
+	/*
+	 * If the buffer is full, for now we just rewind by one slot and overwrite
+	 * the last sample, as a simple (if somewhat uneven) way to lower the
+	 * sampling rate.  There may be better adaptive compaction algorithms.
+	 */
+	if (buffer_full)
+	{
+		new_write_head = LagTracker.write_head;
+		if (LagTracker.write_head > 0)
+			LagTracker.write_head--;
+		else
+			LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;
+	}
+
+	/* Store a sample at the current write head position. */
+	LagTracker.buffer[LagTracker.write_head].lsn = lsn;
+	LagTracker.buffer[LagTracker.write_head].time = local_flush_time;
+	LagTracker.write_head = new_write_head;
+}
+
+/*
+ * Find out how much time has elapsed between the moment WAL position 'lsn'
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.
+ * We have a separate read head for each of the reported LSN locations we
+ * receive in replies from standby; 'head' controls which read head is
+ * used.  Whenever a read head crosses an LSN which was written into the
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to
+ * find out the time this LSN (or an earlier one) was flushed locally, and
+ * therefore compute the lag.
+ *
+ * Return -1 if no new sample data is available, and otherwise the elapsed
+ * time in microseconds.
+ */
+static TimeOffset
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
+{
+	TimestampTz time = 0;
+
+	/* Read all unread samples up to this LSN or end of buffer. */
+	while (LagTracker.read_heads[head] != LagTracker.write_head &&
+		   LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)
+	{
+		time = LagTracker.buffer[LagTracker.read_heads[head]].time;
+		LagTracker.last_read[head] =
+			LagTracker.buffer[LagTracker.read_heads[head]];
+		LagTracker.read_heads[head] =
+			(LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
+	}
+
+	if (time > now)
+	{
+		/* If the clock somehow went backwards, treat as not found. */
+		return -1;
+	}
+	else if (time == 0)
+	{
+		/*
+		 * We didn't cross a time.  If there is a future sample that we
+		 * haven't reached yet, and we've already reached at least one sample,
+		 * let's interpolate the local flushed time.  This is mainly useful for
+		 * reporting a completely stuck apply position as having increasing
+		 * lag, since otherwise we'd have to wait for it to eventually start
+		 * moving again and cross one of our samples before we can show the
+		 * lag increasing.
+		 */
+		if (LagTracker.read_heads[head] != LagTracker.write_head &&
+			LagTracker.last_read[head].time != 0)
+		{
+			double fraction;
+			WalTimeSample prev = LagTracker.last_read[head];
+			WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];
+
+			Assert(lsn >= prev.lsn);
+			Assert(prev.lsn < next.lsn);
+
+			if (prev.time > next.time)
+			{
+				/* If the clock somehow went backwards, treat as not found. */
+				return -1;
+			}
+
+			/* See how far we are between the previous and next samples. */
+			fraction =
+				(double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);
+
+			/* Scale the local flush time proportionally. */
+			time = (TimestampTz)
+				((double) prev.time + (next.time - prev.time) * fraction);
+		}
+		else
+		{
+			/* Couldn't interpolate due to lack of data. */
+			return -1;
+		}
+	}
+
+	/* Return the elapsed time since local flush time in microseconds. */
+	Assert(time != 0);
+	return now - time;
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 836d6ff..2b9a3c6 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2801,7 +2801,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,23,25}" "{o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index fd34964..44da2e9 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -97,6 +97,8 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 									  XLogRecPtr restart_lsn);
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
+extern void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
+
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
 
 #endif
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..2c59056 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,11 @@ typedef struct WalSnd
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/* Measured lag times, or -1 for unknown/none. */
+	TimeOffset	writeLag;
+	TimeOffset	flushLag;
+	TimeOffset	applyLag;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bd13ae6..55b5ca7 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1831,10 +1831,13 @@ pg_stat_replication| SELECT s.pid,
     w.write_location,
     w.flush_location,
     w.replay_location,
+    w.write_lag,
+    w.flush_lag,
+    w.replay_lag,
     w.sync_priority,
     w.sync_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_location, write_location, flush_location, replay_location, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
#53Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#52)
Re: Measuring replay lag

On Thu, Mar 23, 2017 at 10:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Second thoughts... I'll just make LagTrackerWrite externally
available, so a plugin can send anything it wants to the tracker.
Which means I'm explicitly removing the "logical replication support"
from this patch.

Done.

Here's the patch I'm looking to commit, with some docs and minor code
changes as discussed.

Giving LagTrackerWrite external linkage seems sensible, assuming there
is a reasonable way for logical replication to decide when to call it.

+    system. Monitoring systems should choose whether the represent this
+    as missing data, zero or continue to display the last known value.

s/whether the/whether to/

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Simon Riggs (#52)
Re: Measuring replay lag

On Thu, Mar 23, 2017 at 10:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Second thoughts... I'll just make LagTrackerWrite externally
available, so a plugin can send anything it wants to the tracker.
Which means I'm explicitly removing the "logical replication support"
from this patch.

Done.

Here's the patch I'm looking to commit, with some docs and minor code
changes as discussed.

Thank you for committing this. Time-based replication lag tracking
seems to be a regular topic on mailing lists and IRC, so I hope that
this will provide what people are looking for and not simply replace
that discussion with a new discussion about what lag really means :-)

Many thanks to Simon and Fujii-san for convincing me to move the
buffer to the sender (which now seems so obviously better), to
Fujii-san for the idea of tracking write and flush lag too, and to
Abhijit, Sawada-san, Ian, Craig and Robert for valuable feedback.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#54)
Re: Measuring replay lag

On 24 March 2017 at 05:39, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Fujii-san for the idea of tracking write and flush lag too

You mentioned wishing that logical replication would update sent lag
as the decoding position.

It appears to do just that already; see the references to restart_lsn
in StartLogicalReplication, and the update of sentPtr in
XLogSendLogical .

It's a bit misleading, since it hasn't *sent* anything, it buffers it
until commit. But it's useful.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers