Fix checkpoint skip logic on idle systems by tracking LSN progress

Started by Michael Paquierover 9 years ago63 messages

michael.paquier@gmail.com

over 9 years ago

2 attachment(s)

Hi all,

A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.

One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.

Attached is a patch implementing a solution for it, by adding in
WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.
Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.

I am adding that to the commit fest of September.

Regards,
--
Michael

Attachments:

hs-checkpoints-v11.patchinvalid/octet-stream; name=hs-checkpoints-v11.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2eb04d6..76cb830 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2519,7 +2519,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2858,7 +2858,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3320,7 +3320,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -5852,7 +5852,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7504,7 +7504,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 95690ff..13e5f2f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5236,7 +5236,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b473f19..db8e594 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -439,11 +439,30 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -881,6 +900,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. As of
+ * now, this controls if the progress LSN positions are updated.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -893,7 +915,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -992,6 +1016,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4717,6 +4760,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7901,6 +7945,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
+ * at the last significant WAL activity, or in other words any activity
+ * not referring to standby logging as of now. Finding the last activity
+ * position is done by scanning each WAL insertion lock by taking directly
+ * the light-weight lock associated to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8160,7 +8253,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8241,34 +8334,30 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c37003a..b7e1e1b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -387,10 +387,10 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
  * Should this record include the replication origin if one is set up?
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +450,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +701,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 00f03d8..79cfd7b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -310,7 +310,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -319,19 +319,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some xlog record
+			 * has been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index efcc25a..cd33bc4 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 762dfa6..911ea04 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -962,7 +962,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -986,6 +987,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1033,6 +1036,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0744c3f..b77e780 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -182,6 +182,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -209,7 +215,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -260,6 +268,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

hs-checkpoints-v11-2.patchinvalid/octet-stream; name=hs-checkpoints-v11-2.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index db8e594..e25bdcd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8357,8 +8357,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+			 (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+			 (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8525,7 +8529,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 79cfd7b..082e589 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -333,7 +333,9 @@ BackgroundWriterMain(void)
 				GetLastCheckpointRecPtr() < current_progress_lsn &&
 				last_progress_lsn < current_progress_lsn)
 			{
-				(void) LogStandbySnapshot();
+				XLogRecPtr lsn = LogStandbySnapshot();
+				elog(LOG, "snapshot taken by bgwriter %X/%X",
+					 (uint32) (lsn >> 32), (uint32) lsn);
 				last_snapshot_ts = now;
 				last_progress_lsn = current_progress_lsn;
 			}

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#1)

2 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, May 19, 2016 at 6:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I am adding that to the commit fest of September.

And a lot of activity has happened here since. Attached are refreshed
patches based on da6c4f6. v11 still applies correctly but it's always
better to avoid hunks when applying them.
--
Michael

Attachments:

hs-checkpoints-v12-2.patchtext/x-diff; charset=US-ASCII; name=hs-checkpoints-v12-2.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 289d240..0fd2e2b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8448,8 +8448,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+			 (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+			 (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8616,7 +8620,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3a791eb..7637a1d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -331,7 +331,9 @@ BackgroundWriterMain(void)
 				GetLastCheckpointRecPtr() < current_progress_lsn &&
 				last_progress_lsn < current_progress_lsn)
 			{
-				(void) LogStandbySnapshot();
+				XLogRecPtr lsn = LogStandbySnapshot();
+				elog(LOG, "snapshot taken by bgwriter %X/%X",
+					 (uint32) (lsn >> 32), (uint32) lsn);
 				last_snapshot_ts = now;
 				last_progress_lsn = current_progress_lsn;
 			}

hs-checkpoints-v12.patchtext/x-diff; charset=US-ASCII; name=hs-checkpoints-v12.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1b9a97..289d240 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,30 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -882,6 +901,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. As of
+ * now, this controls if the progress LSN positions are updated.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -894,7 +916,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -993,6 +1017,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4716,6 +4759,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7992,6 +8036,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
+ * at the last significant WAL activity, or in other words any activity
+ * not referring to standby logging as of now. Finding the last activity
+ * position is done by scanning each WAL insertion lock by taking directly
+ * the light-weight lock associated to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8251,7 +8344,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8332,34 +8425,30 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..23f1e67 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -387,10 +387,10 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
  * Should this record include the replication origin if one is set up?
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +450,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +701,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1002034..3a791eb 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some xlog record
+			 * has been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 547f1a8..9774155 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -963,7 +963,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -987,6 +988,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1034,6 +1037,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..dbd4cff 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: Michael Paquier (#1)

3 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi,

I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(

The attached patches are rebased to the master and additional one
mentioned below.

At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>

A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.

One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.
Attached is a patch implementing a solution for it, by adding in

If I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.

WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.

WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.

/messages/by-id/CAA4eK1LB9HDq+F8Lw8bGRQx6Sw42XaikX1UQ2DZk+AuEGbfjWA@mail.gmail.com
Amit> Taking and releasing locks again and again (which is done in patch)
Amit> matters much more than adding few instructions under it and I think
Amit> if you would have written the code in-a-way as in patch in some of
Amit> the critical path, it would have been regressed badly, but because
Amit> checkpoint doesn't happen often, reproducing regression is difficult.

/messages/by-id/CAB7nPqSO6HVJ0T6LUT84PCy+_ihitdt64Ze2D+SJrHZy72Y0wg@mail.gmail.com

Also, I think it is worth to once take the performance data for
write tests (using pgbench 30 minute run or some other way) with
minimum checkpoint_timeout (i.e 30s) to see if the additional locking
has any impact on performance. I think taking locks at intervals
of 15s or 30s should not matter much, but it is better to be safe.

I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.

I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.

Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.

This looks reasonable.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-hs-checkpoints-v12-1.patchtext/x-patch; charset=us-asciiDownload

From 686e4981c0d7ab3dd9e919f8b203aeb475f89a3b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 10:19:55 +0900
Subject: [PATCH 1/3] hs-checkpoints-v12-1

Rebased version of v11-1
---
 src/backend/access/heap/heapam.c          |  10 +--
 src/backend/access/transam/xact.c         |   2 +-
 src/backend/access/transam/xlog.c         | 127 +++++++++++++++++++++++++-----
 src/backend/access/transam/xloginsert.c   |  15 ++--
 src/backend/postmaster/bgwriter.c         |  22 +++---
 src/backend/postmaster/checkpointer.c     |   1 +
 src/backend/replication/logical/message.c |   2 +-
 src/backend/storage/ipc/standby.c         |   6 +-
 src/include/access/xlog.h                 |  12 ++-
 src/include/access/xloginsert.h           |   2 +-
 10 files changed, 154 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1b9a97..289d240 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,30 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -882,6 +901,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. As of
+ * now, this controls if the progress LSN positions are updated.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -894,7 +916,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -993,6 +1017,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4716,6 +4759,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7992,6 +8036,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
+ * at the last significant WAL activity, or in other words any activity
+ * not referring to standby logging as of now. Finding the last activity
+ * position is done by scanning each WAL insertion lock by taking directly
+ * the light-weight lock associated to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8251,7 +8344,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8332,34 +8425,30 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..23f1e67 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -387,10 +387,10 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
  * Should this record include the replication origin if one is set up?
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +450,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +701,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1002034..3a791eb 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some xlog record
+			 * has been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d702a48..a729a3d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -603,6 +603,7 @@ CheckArchiveTimeout(void)
 		XLogRecPtr	switchpoint;
 
 		/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");
 		switchpoint = RequestXLogSwitch();
 
 		/*
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 547f1a8..9774155 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -963,7 +963,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -987,6 +988,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1034,6 +1037,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..dbd4cff 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);
-- 
2.9.2

0002-hs-checkpoints-v12-2.patchtext/x-patch; charset=us-asciiDownload

From 676ab7c15ccb99e4e18cb1aceef60795223ab569 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 16:21:54 +0900
Subject: [PATCH 2/3] hs-checkpoints-v12-2

Make XLOG_SWITCH NO_PROGRESS and manage log switching LSN to avoid
excessive log switching and checkpoints.
---
 src/backend/access/transam/xlog.c     |  2 ++
 src/backend/postmaster/checkpointer.c | 40 +++++++++++++++++++++++++----------
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 289d240..a582759 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9191,6 +9191,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_NO_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a729a3d..4b7ff4b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -600,20 +600,38 @@ CheckArchiveTimeout(void)
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		elog(LOG, "Request XLog Switch");
-		switchpoint = RequestXLogSwitch();
+		static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
 
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * switch segment only when any substantial progress have made from
+		 * the last segment switching by timeout. Segment switching by other
+		 * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
+		 * matter since it just causes possible one excessive segment switch.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_xlog_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			/* OK, it's time to switch */
+			elog(LOG, "Request XLog Switch");
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+
+			/*
+			 * This switchpoint is not the LSN for the next XLOG record but
+			 * just after this log switch record. But either will do for
+			 * comparing with GetProgressRecPtr().
+			 */
+			last_xlog_switch_lsn = switchpoint;
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
-- 
2.9.2

0003-hs-checkpoints-v12-3.patchtext/x-patch; charset=us-asciiDownload

From 6cf5801882e1206c1da11b4627175f64b5a4ca97 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 10:20:12 +0900
Subject: [PATCH 3/3] hs-checkpoints-v12-3

Rebased version of v11-2. Several debugging logs.
---
 src/backend/access/transam/xlog.c | 10 +++++++++-
 src/backend/postmaster/bgwriter.c |  4 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a582759..3795037 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8448,8 +8448,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+			 (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+			 (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8616,7 +8620,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3a791eb..7637a1d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -331,7 +331,9 @@ BackgroundWriterMain(void)
 				GetLastCheckpointRecPtr() < current_progress_lsn &&
 				last_progress_lsn < current_progress_lsn)
 			{
-				(void) LogStandbySnapshot();
+				XLogRecPtr lsn = LogStandbySnapshot();
+				elog(LOG, "snapshot taken by bgwriter %X/%X",
+					 (uint32) (lsn >> 32), (uint32) lsn);
 				last_snapshot_ts = now;
 				last_progress_lsn = current_progress_lsn;
 			}
-- 
2.9.2

David Steele

david@pgmasters.net

over 9 years ago

In reply to: Kyotaro HORIGUCHI (#3)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 9/27/16 6:16 AM, Kyotaro HORIGUCHI wrote:

I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(

The attached patches are rebased to the master and additional one
mentioned below.

I tried the attached patch set and noticed an interesting behavior.
With archive_timeout=5 whenever I made a change I would get a WAL
segment within a few seconds as expected then another one would follow a
few minutes later.

Database init:
16M Sep 27 20:05 000000010000000000000001
16M Sep 27 20:09 000000010000000000000002

Create test table:
16M Sep 27 20:13 000000010000000000000003
16M Sep 27 20:15 000000010000000000000004

Insert row into test table:
16M Sep 27 20:46 000000010000000000000005
16M Sep 27 20:49 000000010000000000000006

The cluster was completely idle with no sessions connected in between
those three commands. Is it possible this is caused by:

+		 * switch segment only when any substantial progress have made from
+		 * the last segment switching by timeout. Segment switching by other
+		 * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
+		 * matter since it just causes possible one excessive segment switch.
  		 */

I would like to give Michael a chance to respond to the updated patches
before delving deeper.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Kyotaro HORIGUCHI (#3)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Sep 27, 2016 at 7:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(

No problem.

At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>

A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
/messages/by-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.

One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.
Attached is a patch implementing a solution for it, by adding in

If I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.

Possibly. That's a second problem I did not want to tackle now. I was
going to study that more precisely after this set of patches gets
done. There is already enough complication in them, and they solve a
large portion of the problem.

WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.

WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.

Based on my past look at the problem and memories, having a variable
in WALInsertLock allows use to not have to touch the hottest spinlock
code path in WAL insertion and PG: ReserveXLogInsertLocation(). I'd
rather still avoid that.

Also, I think it is worth to once take the performance data for
write tests (using pgbench 30 minute run or some other way) with
minimum checkpoint_timeout (i.e 30s) to see if the additional locking
has any impact on performance. I think taking locks at intervals
of 15s or 30s should not matter much, but it is better to be safe.

I don't think performance will be impacted, but there are no reasons
to not do any measurements either. I'll try to get some graphs
tomorrow with runs on my laptop, mainly looking for any effects of
this patch on TPS when checkpoints show up.

I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.

Yeah, I couldn't either.. Still I would like to keep the hot spinlock
section as small as possible if we can.

Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.

This looks reasonable.

Thanks. That would be fine as a first, independent patch, but that
would mean that XLogSetFlags has only only value, which is a bit
pointless so I grouped them. And this makes the exiting interface
cleaner as well for replication origins.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: David Steele (#4)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:

I tried the attached patch set and noticed an interesting behavior. With
archive_timeout=5 whenever I made a change I would get a WAL segment within
a few seconds as expected then another one would follow a few minutes later.

That's intentional. We may be able to make XLOG_SWITCH records as not
updating the progress LSN, but I wanted to tackle that as a separate
patch once we got the basics done correctly, which is still what I
think this patch is doing. I should have been more precise upthread:
this patch makes the handling of checkpoint skip logic correct for
only standby snapshots, not segment switches, and puts the infra to
handle other things.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Steele

david@pgmasters.net

over 9 years ago

In reply to: Michael Paquier (#6)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 9/28/16 3:35 AM, Michael Paquier wrote:

On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:

I tried the attached patch set and noticed an interesting behavior. With
archive_timeout=5 whenever I made a change I would get a WAL segment within
a few seconds as expected then another one would follow a few minutes later.

That's intentional. We may be able to make XLOG_SWITCH records as not
updating the progress LSN, but I wanted to tackle that as a separate
patch once we got the basics done correctly, which is still what I
think this patch is doing. I should have been more precise upthread:
this patch makes the handling of checkpoint skip logic correct for
only standby snapshots, not segment switches, and puts the infra to
handle other things.

OK, I've done functional testing and this patch seems to work as
specified (including the caveat noted above). Some comments:

* [PATCH 1/3] hs-checkpoints-v12-1

+++ b/src/backend/access/transam/xlog.c
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.

How about, "Taking a lock is also necessary..."

+ LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);

That's a lot of exclusive locks and that would seem to have performance
implications. It seems to me this is going to be a hard one to
benchmark because the regression (if any) would only be seen under heavy
load on a very large system.

In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.

+++ b/src/backend/access/transam/xloginsert.c * Should this record
include the replication origin if one is set up?

Outdated comment from XLogIncludeOrigin().

* [PATCH 2/3] hs-checkpoints-v12-2

+++ b/src/backend/postmaster/checkpointer.c
+			/* OK, it's time to switch */
+			elog(LOG, "Request XLog Switch");

LOG level seems a bit much here, perhaps DEBUG1?

* [PATCH 3/3] hs-checkpoints-v12-3

+		 * switch segment only when any substantial progress have made from
+		 * reasons will cause last_xlog_switch_lsn stay behind but it doesn't

How about, "Switch segment only when substantial progress has been made
after the last segment was switched by a timeout. Segment switching for
other reasons..."

+++ b/src/backend/access/transam/xlog.c
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X,
ckpt %X/%X",
+			elog(LOG, "Checkpoint is skipped");
+		elog(LOG, "snapshot taken by checkpoint %X/%X",

Same for the above, seems like it would just be noise for most users.

+++ b/src/backend/postmaster/bgwriter.c
+				elog(LOG, "snapshot taken by bgwriter %X/%X",

Ditto.

I don't see any unintended consequences in this patch but it doesn't
mean there aren't any. I'm definitely concerned by the exclusive locks
but it may turn out they do not actually represent a bottleneck.

This does seem like the kind of patch that should get committed very
early in the release cycle to allow maximum time for regression testing.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: David Steele (#7)

2 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:

OK, I've done functional testing and this patch seems to work as
specified (including the caveat noted above). Some comments:

Thanks!

* [PATCH 1/3] hs-checkpoints-v12-1
+++ b/src/backend/access/transam/xlog.c
+        * Taking a lock is as well necessary to prevent potential torn reads
+        * on some platforms.
How about, "Taking a lock is also necessary..."

+ LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);

That's a lot of exclusive locks and that would seem to have performance
implications. It seems to me this is going to be a hard one to
benchmark because the regression (if any) would only be seen under heavy
load on a very large system.

In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.

Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.

+++ b/src/backend/access/transam/xloginsert.c * Should this record
include the replication origin if one is set up?

Outdated comment from XLogIncludeOrigin().

Fixed. I added as well some comments on top of XLogSetFlags to mention
what are the flags that can be used. I didn't think that it was
necessary to add an assertion here. Also, I noticed that the comment
on top of XLogInsertRecord mentioned those flags but was incorrect.

* [PATCH 2/3] hs-checkpoints-v12-2
+++ b/src/backend/postmaster/checkpointer.c
+                       /* OK, it's time to switch */
+                       elog(LOG, "Request XLog Switch");
LOG level seems a bit much here, perhaps DEBUG1?

That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.

* [PATCH 3/3] hs-checkpoints-v12-3
+                * switch segment only when any substantial progress have made from
+                * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
How about, "Switch segment only when substantial progress has been made
after the last segment was switched by a timeout. Segment switching for
other reasons..."
+++ b/src/backend/access/transam/xlog.c
+               elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X,
ckpt %X/%X",
+                       elog(LOG, "Checkpoint is skipped");
+               elog(LOG, "snapshot taken by checkpoint %X/%X",
Same for the above, seems like it would just be noise for most users.
+++ b/src/backend/postmaster/bgwriter.c
+                               elog(LOG, "snapshot taken by bgwriter %X/%X",
Ditto.

The original patch was developed to ease debugging, and I chose LOG to
not be polluted with a bunch of DEBUG1 entries :)

Now we can do something, as follows:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8450,6 +8450,8 @@ CreateCheckPoint(int flags)
    {
        if (progress_lsn == ControlFile->checkPoint)
        {
+           if (log_checkpoints)
+               ereport(LOG, "checkpoint skipped");
            WALInsertLockRelease();
            LWLockRelease(CheckpointLock);
            END_CRIT_SECTION();
Letting users know that the checkpoint has been skipped sounds like a
good idea. Perhaps that's better if squashed with the first patch.

I don't see any unintended consequences in this patch but it doesn't
mean there aren't any. I'm definitely concerned by the exclusive locks
but it may turn out they do not actually represent a bottleneck.

That's a hard to see a difference. Perhaps I didn't try hard enough..

Well for now attached are two patches, that could just be squashed into one.
--
Michael

Attachments:

hs-checkpoints-v13-2.patchtext/x-diff; charset=US-ASCII; name=hs-checkpoints-v13-2.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f95fdb8..e87caa6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8450,6 +8450,8 @@ CreateCheckPoint(int flags)
 	{
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			if (log_checkpoints)
+				ereport(LOG, "checkpoint skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();

hs-checkpoints-v13-1.patchtext/x-diff; charset=US-ASCII; name=hs-checkpoints-v13-1.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1b9a97..f95fdb8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,30 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -882,6 +901,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -894,7 +916,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -993,6 +1017,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4716,6 +4759,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7992,6 +8036,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
+ * at the last significant WAL activity, or in other words any activity
+ * not referring to standby logging as of now. Finding the last activity
+ * position is done by scanning each WAL insertion lock by taking directly
+ * the light-weight lock associated to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8251,7 +8344,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8332,34 +8425,30 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..5b0590c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_NO_PROGRESS, to not update the WAL progress trackers when inserting
+ *   the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1002034..c790ac8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some new WAL
+			 * records have been inserted since the last time we came here.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d702a48..a729a3d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -603,6 +603,7 @@ CheckArchiveTimeout(void)
 		XLogRecPtr	switchpoint;
 
 		/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");
 		switchpoint = RequestXLogSwitch();
 
 		/*
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 547f1a8..9774155 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -963,7 +963,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -987,6 +988,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1034,6 +1037,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..338c796 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

David Steele

david@pgmasters.net

over 9 years ago

In reply to: Michael Paquier (#8)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 9/28/16 10:32 PM, Michael Paquier wrote:

On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:

In general I agree with the other comments that this could end up being
a problem. On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.

Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.

I don't have any better ideas than that.

+++ b/src/backend/postmaster/checkpointer.c
+                       /* OK, it's time to switch */
+                       elog(LOG, "Request XLog Switch");
LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.

I still see this:

+++ b/src/backend/postmaster/checkpointer.c
  		/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");

Well for now attached are two patches, that could just be squashed into one.

Yes, I think that makes sense.

More importantly, there is a regression. With your new patch the xlogs
are switching on archive_timeout again even with no changes. The v12
worked fine.

The differences are all in 0002-hs-checkpoints-v12-2.patch and as far as
I can see the patch does not work correctly without these changes. Am I
missing something?

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: David Steele (#9)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Sorry, I might have torn off this thread somehow..

At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in <30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>

On 9/28/16 10:32 PM, Michael Paquier wrote:

On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
wrote:

In general I agree with the other comments that this could end up
being
a problem. On the other hand, since the additional locks are only
taken
at checkpoint or archive_timeout it may not be that big a deal.

Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.

I don't have any better ideas than that.

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when
NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
rather alleviates the impact. But it actually doesn't seem so
harmful up to 8. (Even though I don't like the locking in
GetProgressRecPtr..)

Currently possiblly harmful calling of GetProgressRecPtr is that
in BackgroundWriterMain. It should be called with ther interval
BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
see the issuer of SIGUSR1 of bgwriter..

+++ b/src/backend/postmaster/checkpointer.c
+                       /* OK, it's time to switch */
+                       elog(LOG, "Request XLog Switch");
LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.
I still see this:
+++ b/src/backend/postmaster/checkpointer.c
/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");
Well for now attached are two patches, that could just be squashed
into one.

Mmmm. Sorry, this was for my quite private instant debug, spilt
outside.. But I don't mind to leave it with DEBUG2 if it seems
useful.

Yes, I think that makes sense.

More importantly, there is a regression. With your new patch the
xlogs are switching on archive_timeout again even with no changes.
The v12 worked fine.

As Michael mentioned in this or another thread, it is another
issue that he wants to solve separately. I personally doubt that
this patch (v11 and v13) can be evaluated alone without it, but
we can review this with the excessive switching problem, perhaps?

The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
as I can see the patch does not work correctly without these changes.
Am I missing something?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: Kyotaro HORIGUCHI (#10)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Sorry, it wrote wrong thing.

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>

Sorry, I might have torn off this thread somehow..

At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in <30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>

On 9/28/16 10:32 PM, Michael Paquier wrote:

On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
wrote:

In general I agree with the other comments that this could end up
being
a problem. On the other hand, since the additional locks are only
taken
at checkpoint or archive_timeout it may not be that big a deal.

Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.

I don't have any better ideas than that.

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
rather alleviates the impact. But it actually doesn't seem so
harmful up to 8. (Even though I don't like the locking in
GetProgressRecPtr..)

Currently possiblly harmful calling of GetProgressRecPtr is that
in BackgroundWriterMain. It should be called with ther interval
BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
see the issuer of SIGUSR1 of bgwriter..
+++ b/src/backend/postmaster/checkpointer.c
+                       /* OK, it's time to switch */
+                       elog(LOG, "Request XLog Switch");
LOG level seems a bit much here, perhaps DEBUG1?
That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.
I still see this:
+++ b/src/backend/postmaster/checkpointer.c
/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");
Well for now attached are two patches, that could just be squashed
into one.
Mmmm. Sorry, this was for my quite private instant debug, spilt
outside.. But I don't mind to leave it with DEBUG2 if it seems
useful.

Yes, I think that makes sense.

More importantly, there is a regression. With your new patch the
xlogs are switching on archive_timeout again even with no changes.
The v12 worked fine.

As Michael mentioned in this or another thread, it is another
issue that he wants to solve separately. I personally doubt that
this patch (v11 and v13) can be evaluated alone without it, but
we can review this with the excessive switching problem, perhaps?

The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
as I can see the patch does not work correctly without these changes.
Am I missing something?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Kyotaro HORIGUCHI (#11)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#12)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..

Note: I am moving this patch to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#13)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

(Squashing replies)

On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..

Note: I am moving this patch to next CF.

And I am back on it more seriously... And I am taking back what I said upthread.

I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.

There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.

I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.

How does that look?
--
Michael

Attachments:

hs-checkpoints-v14.patchapplication/x-download; name=hs-checkpoints-v14.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e826c19..1e35ede 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,12 +2826,9 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
+        including a single checkpoint.  Note that archived files that are
+        closed early due to a forced switch are still the same length as
+        completely full files.  Therefore, it is unwise to use a very short
         <varname>archive_timeout</> &mdash; it will bloat your archive
         storage.  <varname>archive_timeout</> settings of a minute or so are
         usually reasonable.  You should consider using streaming replication,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08c87f9..49daaa1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,31 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ *
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -882,6 +902,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -894,7 +917,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -993,6 +1018,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4716,6 +4760,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7993,6 +8038,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8252,7 +8346,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8333,35 +8427,33 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			if (log_checkpoints)
+				ereport(LOG, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9103,6 +9195,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_NO_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..5b0590c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_NO_PROGRESS, to not update the WAL progress trackers when inserting
+ *   the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..bb7740e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some new WAL
+			 * records have been inserted since the last time we came here.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..e1c470c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
 
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
 
 /* Prototypes for private functions */
 
@@ -601,19 +602,36 @@ CheckArchiveTimeout(void)
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual trigerring of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_xlog_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+
+			/*
+			 * Save the segment switch LSN. This may refer to the beginning of
+			 * the next new segment in case of consecutive switches.
+			 */
+			last_xlog_switch_lsn = switchpoint;
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index fb887b3..7c1d9a5 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..338c796 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#15

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#14)

2 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Thanks for merging. It still applies on the current master with
some displacements.

At Wed, 5 Oct 2016 15:18:53 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT4U=OSOLXuFuxMonmfdQFmd5F_0DmKoddvjG-HHWQaBA@mail.gmail.com>

(Squashing replies)

On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..

Note: I am moving this patch to next CF.

And I am back on it more seriously... And I am taking back what I said upthread.

I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,

Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?

I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)

I found that the former reuiqres to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)

For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.

including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.

There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.

I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.

How does that look?

All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Alternative-Type-1.patchtext/x-patch; charset=us-asciiDownload

From e09147a91d20e3c86da26e311518f92aea193bc1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 1 Nov 2016 18:28:58 +0900
Subject: [PATCH] Alternative Type 1.

Maintenance progressAt in SpinLock(Insert->insertpos_lck) section in
ReserveXLogInsertLocation. Even if additinal code into the section
does no harm, GetProgressRecPtr also needs to take the same lock.
---
 src/backend/access/transam/xlog.c | 56 ++++++++++-----------------------------
 1 file changed, 14 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1eff059..0703e5b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -466,7 +466,6 @@ typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
-	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -672,9 +671,13 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	uint64	progressAt;			/* This is not locked by info_lck but it is
+								 * here for just convenience. */
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -1021,25 +1024,6 @@ XLogInsertRecord(XLogRecData *rdata,
 		inserted = true;
 	}
 
-	/*
-	 * Update the progress LSN positions. At least one WAL insertion lock
-	 * is already taken appropriately before doing that, and it is just more
-	 * simple to do that here where WAL record data and type is at hand.
-	 * The progress is set at the start position of the record tracked that
-	 * is being added, making easier checkpoint progress tracking as the
-	 * control file already saves the start LSN position of the last
-	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
-	 * there is actually no need to update all the progression fields, so
-	 * just do it on the first one.
-	 */
-	if ((flags & XLOG_NO_PROGRESS) == 0)
-	{
-		if (holdingAllLocks)
-			WALInsertLocks[0].l.progressAt = StartPos;
-		else
-			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
-	}
-
 	if (inserted)
 	{
 		/*
@@ -1195,6 +1179,7 @@ static void
 ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 						  XLogRecPtr *PrevPtr)
 {
+	volatile XLogCtlData *xlogctl = XLogCtl;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
@@ -1222,6 +1207,8 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	prevbytepos = Insert->PrevBytePos;
 	Insert->CurrBytePos = endbytepos;
 	Insert->PrevBytePos = startbytepos;
+	if (xlogctl->progressAt < startbytepos)
+		xlogctl->progressAt = startbytepos;
 
 	SpinLockRelease(&Insert->insertpos_lck);
 
@@ -4763,7 +4750,6 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
-		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8050,29 +8036,15 @@ GetFlushRecPtr(void)
 XLogRecPtr
 GetProgressRecPtr(void)
 {
-	XLogRecPtr	res = InvalidXLogRecPtr;
-	int			i;
+	volatile XLogCtlData *xlogctl = XLogCtl;
+	uint32 bret;
 
-	/*
-	 * Look at the latest LSN position referring to the activity done by
-	 * WAL insertion. An exclusive lock is taken because currently the
-	 * locking logic for WAL insertion only expects such a level of locking.
-	 * Taking a lock is as well necessary to prevent potential torn reads
-	 * on some platforms.
-	 */
-	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
-	{
-		XLogRecPtr	progress_lsn;
-
-		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
-		progress_lsn = WALInsertLocks[i].l.progressAt;
-		LWLockRelease(&WALInsertLocks[i].l.lock);
-
-		if (res < progress_lsn)
-			res = progress_lsn;
-	}
+	/* XXXX: Taking insertpos_lck is too bad but it's necessary :( */
+	SpinLockAcquire(&xlogctl->Insert.insertpos_lck);
+	bret = xlogctl->progressAt;
+	SpinLockRelease(&xlogctl->Insert.insertpos_lck);
 
-	return res;
+	return XLogBytePosToRecPtr(bret);
 }
 
 /*
-- 
2.9.2

0001-Alternative-Type-2-atomic.patchtext/x-patch; charset=us-asciiDownload

From cc0167a03c35a5630983b5bed725913383b47ed4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 1 Nov 2016 18:28:58 +0900
Subject: [PATCH] Alternative Type 2 atomic.

Use atomic_u64 for holding progressAt.
---
 src/backend/access/transam/xlog.c | 42 +++++++++++----------------------------
 1 file changed, 12 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1eff059..bc78e9e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -466,7 +466,6 @@ typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
-	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -672,9 +671,13 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	pg_atomic_uint64	progressAt; /* This is not locked by info_lck but this
+									 * is here for just convenient now */
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -1021,23 +1024,13 @@ XLogInsertRecord(XLogRecData *rdata,
 		inserted = true;
 	}
 
-	/*
-	 * Update the progress LSN positions. At least one WAL insertion lock
-	 * is already taken appropriately before doing that, and it is just more
-	 * simple to do that here where WAL record data and type is at hand.
-	 * The progress is set at the start position of the record tracked that
-	 * is being added, making easier checkpoint progress tracking as the
-	 * control file already saves the start LSN position of the last
-	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
-	 * there is actually no need to update all the progression fields, so
-	 * just do it on the first one.
-	 */
-	if ((flags & XLOG_NO_PROGRESS) == 0)
 	{
-		if (holdingAllLocks)
-			WALInsertLocks[0].l.progressAt = StartPos;
-		else
-			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+		volatile XLogCtlData *xlogctl = XLogCtl;
+		XLogRecPtr tmpstartpos = pg_atomic_read_u64(&xlogctl->progressAt);
+
+		while (tmpstartpos < StartPos)
+			pg_atomic_compare_exchange_u64(&xlogctl->progressAt,
+										   &tmpstartpos, StartPos);
 	}
 
 	if (inserted)
@@ -4763,7 +4756,6 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
-		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -8051,7 +8043,7 @@ XLogRecPtr
 GetProgressRecPtr(void)
 {
 	XLogRecPtr	res = InvalidXLogRecPtr;
-	int			i;
+	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/*
 	 * Look at the latest LSN position referring to the activity done by
@@ -8060,17 +8052,7 @@ GetProgressRecPtr(void)
 	 * Taking a lock is as well necessary to prevent potential torn reads
 	 * on some platforms.
 	 */
-	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
-	{
-		XLogRecPtr	progress_lsn;
-
-		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
-		progress_lsn = WALInsertLocks[i].l.progressAt;
-		LWLockRelease(&WALInsertLocks[i].l.lock);
-
-		if (res < progress_lsn)
-			res = progress_lsn;
-	}
+	res = pg_atomic_read_u64(&xlogctl->progressAt);
 
 	return res;
 }
-- 
2.9.2

#16

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#15)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?

Be my guest :)

I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)

Okay.

I found that the former requires to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)

Interesting...

For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.

The WAL insert slots have been introduced in 9.4, and the PG atomics
in 9.5, so perhaps the first implementation of the WAL insert slots
would have used it. Still that looks quite promising. At the same time
we may be able to do something for insertingAt to make the locks held
more consistent, and just remove WALInsertLocks, even if this makes me
wonder about torn reads and about how we may break things if we rely
on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.

All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.

I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
XLogInsert flag present on HEAD. Could the patch be marked as "ready
for committer" then?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#16)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hello,

At Tue, 8 Nov 2016 14:45:59 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqT-VW5gRbUJwQusmgiu2MKpZSCV-XdrHv84w8FZa286KQ@mail.gmail.com>

On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?

Be my guest :)

I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)

Okay.

I found that the former requires to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)

Interesting...

For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads. In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.

The WAL insert slots have been introduced in 9.4, and the PG atomics
in 9.5, so perhaps the first implementation of the WAL insert slots
would have used it. Still that looks quite promising. At the same time
we may be able to do something for insertingAt to make the locks held
more consistent, and just remove WALInsertLocks, even if this makes me
wonder about torn reads and about how we may break things if we rely

If I understand you correctly, atomics prevents torn reads by its
definition on cache management and bus arbitration level. It
should be faster than LWlocks but as I said in the previous mail,
on some platforms, if any, it will fallbacks to individual
spinlocks. (atomics.c)

on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.

Ok, the patch looks fine. So there's nothing for me but to accept
the current shape since the discussion about performance seems
not to be settled with out performance measurement with machines
with many cores.

All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.

I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
XLogInsert flag present on HEAD. Could the patch be marked as "ready
for committer" then?

Ok, that is not so siginificant point. Well, I'd like to wait for
a couple of days for anyone wants to comment, then mark this
'ready for committer'.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Michael Paquier (#14)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 10/5/16 7:18 AM, Michael Paquier wrote:

Note: I am moving this patch to next CF.

And I am back on it more seriously... And I am taking back what I said upthread.

I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.

There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.

I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.

How does that look?

This looks much better now and exhibits exactly the behavior that I
expect.

In theory it would be nice if the checkpoint records did not cause
rotation, but this can be mitigated in the way you have described and
perhaps for safety's sake it's best.

I had a bit of trouble parsing this paragraph:

+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so

So I did a little reworking:

Update the LSN progress positions. At least one WAL insertion lock is
already taken appropriately before doing that, and it is simpler to do
that here when the WAL record data and type are at hand. Progress is set
at the start position of the tracked record that is being added, making
checkpoint progress tracking easier as the control file already saves
the start LSN position of the last checkpoint. If an exclusive lock is
taken for WAL insertion there is no need to update all the progress
fields, only the first one.

If that still says what you think it should, then I believe it is
clearer. Also:

+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual trigerring of

typo, should be "triggering".

I don't see any further issues with this patch unless there are
performance concerns about the locks taken in GetProgressRecPtr(). The
locks seem reasonable to me but I'd like to see this committed so
there's plenty of time to detect any regression before 10.0.

As such, my vote is to mark this "Ready for Committer." I'm fine with
waiting a few days as Kyotaro suggested, or we can consider my review
"additional comments" and do it now.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: David Steele (#18)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 8, 2016 at 9:32 PM, David Steele <david@pgmasters.net> wrote:

I had a bit of trouble parsing this paragraph:

[...]

So I did a little reworking:

[...]

If that still says what you think it should, then I believe it is clearer.

Thanks! I have included your suggestion.

Also:

+                * last time a segment has switched because of a timeout.
Segment
+                * switching because of other reasons, like manual
trigerring of

typo, should be "triggering".

Right.

I don't see any further issues with this patch unless there are performance
concerns about the locks taken in GetProgressRecPtr(). The locks seem
reasonable to me but I'd like to see this committed so there's plenty of
time to detect any regression before 10.0.

As such, my vote is to mark this "Ready for Committer." I'm fine with
waiting a few days as Kyotaro suggested, or we can consider my review
"additional comments" and do it now.

Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.
--
Michael

Attachments:

hs-checkpoints-v15.patchtext/plain; charset=US-ASCII; name=hs-checkpoints-v15.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..38c2385 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,12 +2826,9 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
+        including a single checkpoint.  Note that archived files that are
+        closed early due to a forced switch are still the same length as
+        completely full files.  Therefore, it is unwise to use a very short
         <varname>archive_timeout</> &mdash; it will bloat your archive
         storage.  <varname>archive_timeout</> settings of a minute or so are
         usually reasonable.  You should consider using streaming replication,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..37ecf9c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,31 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ *
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -885,6 +905,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +920,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is simpler
+	 * to do that here when the WAL record data and type are at hand.
+	 * Progress is set at the start position of the tracked record that is
+	 * being added, making checkpoint progress tracking easier as the control
+	 * file already saves the start LSN position of the last checkpoint. If
+	 * an exclusive lock is taken for WAL insertion there is no need to
+	 * update all the progress fields, only the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4720,6 +4763,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7999,6 +8043,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8258,7 +8351,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8432,33 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			if (log_checkpoints)
+				ereport(LOG, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9224,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_NO_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..5b0590c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_NO_PROGRESS, to not update the WAL progress trackers when inserting
+ *   the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..bb7740e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some new WAL
+			 * records have been inserted since the last time we came here.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..7ecc00e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
 
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
 
 /* Prototypes for private functions */
 
@@ -601,19 +602,36 @@ CheckArchiveTimeout(void)
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual triggering of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_xlog_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+
+			/*
+			 * Save the segment switch LSN. This may refer to the beginning of
+			 * the next new segment in case of consecutive switches.
+			 */
+			last_xlog_switch_lsn = switchpoint;
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..adc4e1d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..338c796 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#20

Stephen Frost

sfrost@snowman.net

about 9 years ago

In reply to: Michael Paquier (#19)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Michael,

* Michael Paquier (michael.paquier@gmail.com) wrote:

Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.

I figured I'd go ahead and start looking into this (and it's pretty easy
for me to discuss it with David, given he works in the same office ;).

A couple initial comments:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..38c2385 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,12 +2826,9 @@ include_dir 'conf.d'
parameter is greater than zero, the server will switch to a new
segment file whenever this many seconds have elapsed since the last
segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
+        including a single checkpoint.  Note that archived files that are
+        closed early due to a forced switch are still the same length as
+        completely full files.  Therefore, it is unwise to use a very short
<varname>archive_timeout</> &mdash; it will bloat your archive
storage.  <varname>archive_timeout</> settings of a minute or so are
usually reasonable.  You should consider using streaming replication,

We should probably include in here that we may skip a checkpoint if no
activity has happened, meaning that this is a safe setting to set for
environments which are idle for long periods (I'm thinking embedded
systems here).

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c

[...]

+			if (log_checkpoints)
+				ereport(LOG, (errmsg("checkpoint skipped")));

Do we really need to log that we're skipping a checkpoint..? As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.

Thanks!

Stephen

#21

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Stephen Frost (#20)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 11/10/16 10:28 AM, Stephen Frost wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
[...]
+			if (log_checkpoints)
+				ereport(LOG, (errmsg("checkpoint skipped")));
Do we really need to log that we're skipping a checkpoint..? As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.

log_checkpoints is not enabled by default, though, so if the user does
enable it don't you think they would want to know when checkpoints
*don't* happen?

Or are you thinking the main use of this logging is to determine when
checkpoints are too close together and so skipped checkpoints aren't
very important?

Thanks,
--
-David
david@pgmasters.net

#22

Joshua D. Drake

jd@commandprompt.com

about 9 years ago

In reply to: David Steele (#21)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 11/10/2016 09:33 AM, David Steele wrote:

On 11/10/16 10:28 AM, Stephen Frost wrote:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
[...]
+			if (log_checkpoints)
+				ereport(LOG, (errmsg("checkpoint skipped")));
Do we really need to log that we're skipping a checkpoint..? As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.
log_checkpoints is not enabled by default, though, so if the user does
enable it don't you think they would want to know when checkpoints
*don't* happen?

Yes but I don't know that it needs to be anywhere below DEBUG2 (vs
log_checkpoints).

Sincerely,

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Stephen Frost

sfrost@snowman.net

about 9 years ago

In reply to: Joshua D. Drake (#22)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thursday, November 10, 2016, Joshua D. Drake <jd@commandprompt.com>
wrote:

On 11/10/2016 09:33 AM, David Steele wrote:
On 11/10/16 10:28 AM, Stephen Frost wrote:
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c

[...]
+                       if (log_checkpoints)
+                               ereport(LOG, (errmsg("checkpoint
skipped")));
Do we really need to log that we're skipping a checkpoint..? As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.
log_checkpoints is not enabled by default, though, so if the user does
enable it don't you think they would want to know when checkpoints
*don't* happen?
Yes but I don't know that it needs to be anywhere below DEBUG2 (vs
log_checkpoints).

Agreed. You certainly may wish to log checkpoints, even on an embedded or
low I/o system, but logging that nothing is happening doesn't seem useful
except perhaps for debugging.

Thanks!

Stephen

#24

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Stephen Frost (#23)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 11/10/16 1:03 PM, Stephen Frost wrote:

On Thursday, November 10, 2016, Joshua D. Drake <jd@commandprompt.com
<mailto:jd@commandprompt.com>> wrote:

On 11/10/2016 09:33 AM, David Steele wrote:

On 11/10/16 10:28 AM, Stephen Frost wrote:
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
[...]
+                       if (log_checkpoints)
+                               ereport(LOG,
(errmsg("checkpoint skipped")));
Do we really need to log that we're skipping a
checkpoint..? As the
point of this is to avoid write activity on a system which
is idle, it
doesn't make sense to me to add a new cause for writes to
happen when
we're idle.

log_checkpoints is not enabled by default, though, so if the
user does
enable it don't you think they would want to know when checkpoints
*don't* happen?

Yes but I don't know that it needs to be anywhere below DEBUG2 (vs
log_checkpoints).

Agreed. You certainly may wish to log checkpoints, even on an embedded
or low I/o system, but logging that nothing is happening doesn't seem
useful except perhaps for debugging.

Sure, DEBUG1 or DEBUG2 makes sense.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: David Steele (#24)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Nov 11, 2016 at 12:28 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Michael Paquier (michael.paquier@gmail.com) wrote:

Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.

I figured I'd go ahead and start looking into this (and it's pretty easy
for me to discuss it with David, given he works in the same office ;).

Thanks!

A couple initial comments:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..38c2385 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,12 +2826,9 @@ include_dir 'conf.d'
parameter is greater than zero, the server will switch to a new
segment file whenever this many seconds have elapsed since the last
segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
+        including a single checkpoint.  Note that archived files that are
+        closed early due to a forced switch are still the same length as
+        completely full files.  Therefore, it is unwise to use a very short
<varname>archive_timeout</> &mdash; it will bloat your archive
storage.  <varname>archive_timeout</> settings of a minute or so are
usually reasonable.  You should consider using streaming replication,

We should probably include in here that we may skip a checkpoint if no
activity has happened, meaning that this is a safe setting to set for
environments which are idle for long periods.

OK, here is the interesting bit I just updated (I cut the diff a bit
as the rest is just reformatting):
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
[...]
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.

(I'm thinking embedded systems here).

(Those are most of my users :{) ).

On Fri, Nov 11, 2016 at 3:23 AM, David Steele <david@pgmasters.net> wrote:

On 11/10/16 1:03 PM, Stephen Frost wrote:

Agreed. You certainly may wish to log checkpoints, even on an embedded
or low I/o system, but logging that nothing is happening doesn't seem
useful except perhaps for debugging.

Sure, DEBUG1 or DEBUG2 makes sense.

OK. LOG was useful to avoid noise when debugging the thing, but DEBUG1
is fine for me as well in the final version.
--
Michael

Attachments:

hs-checkpoints-v16.patchtext/x-patch; charset=US-ASCII; name=hs-checkpoints-v16.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..d2a8ec2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,17 +2826,16 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
-        <varname>archive_timeout</> &mdash; it will bloat your archive
-        storage.  <varname>archive_timeout</> settings of a minute or so are
-        usually reasonable.  You should consider using streaming replication,
-        instead of archiving, if you want data to be copied off the master
-        server more quickly than that.
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
+        Note that archived files that are closed early due to a forced switch
+        are still the same length as completely full files.  Therefore, it is
+        unwise to use a very short <varname>archive_timeout</> &mdash; it will
+        bloat your archive storage.  <varname>archive_timeout</> settings of
+        a minute or so are usually reasonable.  You should consider using
+        streaming replication, instead of archiving, if you want data to
+        be copied off the master server more quickly than that.
         This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..a676307 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,31 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ *
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -885,6 +905,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +920,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is simpler
+	 * to do that here when the WAL record data and type are at hand.
+	 * Progress is set at the start position of the tracked record that is
+	 * being added, making checkpoint progress tracking easier as the control
+	 * file already saves the start LSN position of the last checkpoint. If
+	 * an exclusive lock is taken for WAL insertion there is no need to
+	 * update all the progress fields, only the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4720,6 +4763,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7999,6 +8043,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8258,7 +8351,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8432,32 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			ereport(DEBUG1, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9223,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_NO_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..5b0590c 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_NO_PROGRESS, to not update the WAL progress trackers when inserting
+ *   the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..bb7740e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some new WAL
+			 * records have been inserted since the last time we came here.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..7ecc00e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
 
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
 
 /* Prototypes for private functions */
 
@@ -601,19 +602,36 @@ CheckArchiveTimeout(void)
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual triggering of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_xlog_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+
+			/*
+			 * Save the segment switch LSN. This may refer to the beginning of
+			 * the next new segment in case of consecutive switches.
+			 */
+			last_xlog_switch_lsn = switchpoint;
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..adc4e1d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..338c796 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#26

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#25)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Thank you for the new patch.

At Fri, 11 Nov 2016 16:42:43 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqR73vusv5kQgZzket5mLZLeEcgNF-3hKh7061QtcZiuVw@mail.gmail.com>

On Fri, Nov 11, 2016 at 12:28 AM, Stephen Frost <sfrost@snowman.net> wrote:

We should probably include in here that we may skip a checkpoint if no
activity has happened, meaning that this is a safe setting to set for
environments which are idle for long periods.
OK, here is the interesting bit I just updated (I cut the diff a bit
as the rest is just reformatting):
parameter is greater than zero, the server will switch to a new
segment file whenever this many seconds have elapsed since the last
segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
[...]
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
(I'm thinking embedded systems here).

(Those are most of my users :{) ).

Ok, (FWIW..,) it seems fine for me.

On Fri, Nov 11, 2016 at 3:23 AM, David Steele <david@pgmasters.net> wrote:

On 11/10/16 1:03 PM, Stephen Frost wrote:

Agreed. You certainly may wish to log checkpoints, even on an embedded
or low I/o system, but logging that nothing is happening doesn't seem
useful except perhaps for debugging.

Sure, DEBUG1 or DEBUG2 makes sense.

OK. LOG was useful to avoid noise when debugging the thing, but DEBUG1
is fine for me as well in the final version.

Agreed. DEBUG2 seems too deep for it.

Well, I think we had the final comment and it has been addressd
so I mark this as ready for committer soon.

Thank you all.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 8, 2016 at 5:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.

Ok, the patch looks fine. So there's nothing for me but to accept
the current shape since the discussion about performance seems
not to be settled with out performance measurement with machines
with many cores.

I think it is good to check the performance impact of this patch on
many core m/c. Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?

@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
xl_standby_lock *locks)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+ XLogSetFlags(XLOG_NO_PROGRESS);

Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Michael Paquier (#25)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi,

On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:

+ * This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.

Something residing on the same cache line doens't provide that guarantee
on all platforms.

+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.

Commit records still have to be written, everything else doesn't write
WAL. So I'm doubtful this matters much?

@@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
inserted = true;
}
+	/*
+	 * Update the LSN progress positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is simpler
+	 * to do that here when the WAL record data and type are at hand.

But we don't use the "WAL record data and type"?

+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */

I'd prefer not to list the specific records here - that's just
guaranteed to get out of date. Why not say something "any activity not
requiring a checkpoint to be triggered" or such?

+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.

Shouldn't this mention archiving and also that we also ignore some forms
of WAL activity?

-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;

that seems a bit too generic a name. 'curinsert_flags'?

/*

@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
{
TimestampTz timeout = 0;
TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();

timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
LOG_SNAPSHOT_INTERVAL_MS);

/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some new WAL
+			 * records have been inserted since the last time we came here.

I think that sentence needs some polish.

*/
if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
{

Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
Don't we need to do the comparisons here (and when doing the checkpoint
itself) with the REDO pointer of the last checkpoint?

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..7ecc00e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;

Hm. Is it a good idea to use a static for this? Did you consider
checkpointer restarts?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Amit Kapila (#27)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hello, thank you for the comment.

At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>

On Tue, Nov 8, 2016 at 5:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.

Ok, the patch looks fine. So there's nothing for me but to accept
the current shape since the discussion about performance seems
not to be settled with out performance measurement with machines
with many cores.

I think it is good to check the performance impact of this patch on
many core m/c. Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?

@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
xl_standby_lock *locks)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+ XLogSetFlags(XLOG_NO_PROGRESS);

Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.

This does not remove any record from WAL. So theoretically any
kind of record can be NO_PROGRESS, but practically as long as
checkpoints are not unreasonably suppressed. Any explicit
database operation must be accompanied with at least commit
record that triggers checkpoint. NO_PROGRESSing there doesn't
seem to me to harm database durability for this reason.

The objective of this patch is skipping WALs on completely-idle
state and the NO_PROGRESSing is necessary to do its work. Of
course we can distinguish exclusive lock with PROGRESS and
without PROGRESS but it is unnecessary complexity.

But rethinking about the above, the namging of "XLOG_NO_PROGRESS"
might be inappropriate. "XLOG_NO_CKPT_TRIGGER" or any sainer name
would be needed.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#29)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>

I think it is good to check the performance impact of this patch on
many core m/c. Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?

I heard about a number like that, and there is no reason to not do
tests to be sure. With that many cores we are more likely going to see
the limitation of the number of XLOG insert slots popping up as a
bottleneck, but that's just an assumption without any numbers.
Alexander (added in CC), could it be possible to get an access to this
machine?

@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
xl_standby_lock *locks)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+ XLogSetFlags(XLOG_NO_PROGRESS);

Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.

This does not remove any record from WAL. So theoretically any
kind of record can be NO_PROGRESS, but practically as long as
checkpoints are not unreasonably suppressed. Any explicit
database operation must be accompanied with at least commit
record that triggers checkpoint. NO_PROGRESSing there doesn't
seem to me to harm database durability for this reason.

The objective of this patch is skipping WALs on completely-idle
state and the NO_PROGRESSing is necessary to do its work. Of
course we can distinguish exclusive lock with PROGRESS and
without PROGRESS but it is unnecessary complexity.

The point that applies here is that logging the exclusive lock
information is necessary for the *standby* recovery conflicts, not the
primary which is why it should not influence the checkpoint activity
that is happening on the primary. So marking this record with
NO_PROGRESS is actually fine to me.

But rethinking about the above, the namging of "XLOG_NO_PROGRESS"
might be inappropriate. "XLOG_NO_CKPT_TRIGGER" or any sainer name
would be needed.

I got fond of NO_PROGRESS to be honest with the time, even if I don't
like much the negative meaning it has... Perhaps something like
XLOG_SKIP_PROGRESS would hold more meaning.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Andres Freund (#28)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
+ * This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
Something residing on the same cache line doens't provide that guarantee
on all platforms.

OK. Let's remove it then.

+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.

Commit records still have to be written, everything else doesn't write
WAL. So I'm doubtful this matters much?

Hm, okay. In most cases this may not matter... Let's rip that off.

@@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
inserted = true;
}
+     /*
+      * Update the LSN progress positions. At least one WAL insertion lock
+      * is already taken appropriately before doing that, and it is simpler
+      * to do that here when the WAL record data and type are at hand.
But we don't use the "WAL record data and type"?

Yes, at some point this patch did so...

+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
I'd prefer not to list the specific records here - that's just
guaranteed to get out of date. Why not say something "any activity not
requiring a checkpoint to be triggered" or such?

OK. Makes sense to minimize maintenance.

+      * If this isn't a shutdown or forced checkpoint, and if there has been no
+      * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+      * duplicate checkpoints when the system is idle. That wastes log space,
+      * and more importantly it exposes us to possible loss of both current and
+      * previous checkpoint records if the machine crashes just as we're writing
+      * the update.

Shouldn't this mention archiving and also that we also ignore some forms
of WAL activity?

I have reworded that as:
"If this isn't a shutdown or forced checkpoint, and if there has been
no WAL activity requiring a checkpoint, skip it."

-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;

that seems a bit too generic a name. 'curinsert_flags'?

OK.

/*
-                      * only log if enough time has passed and some xlog record has
-                      * been inserted.
+                      * Only log if enough time has passed, that some WAL activity
+                      * has happened since last checkpoint, and that some new WAL
+                      * records have been inserted since the last time we came here.

I think that sentence needs some polish.

Let's do this better:
            /*
-            * only log if enough time has passed and some xlog record has
-            * been inserted.
+            * Only log if one of the following conditions is satisfied since
+            * the last time we came here::
+            * - timeout has been reached.
+            * - WAL activity has happened since last checkpoint.
+            * - New WAL records have been inserted.
             */

*/
if (now >= timeout &&
-                             last_snapshot_lsn != GetXLogInsertRecPtr())
+                             GetLastCheckpointRecPtr() < current_progress_lsn &&
+                             last_progress_lsn < current_progress_lsn)
{
Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
Don't we need to do the comparisons here (and when doing the checkpoint
itself) with the REDO pointer of the last checkpoint?

Hm? The progress pointer is pointing to the lastly inserted LSN, which
is not the position of the REDO pointer, but the one of the checkpoint
record. Doing a comparison of the REDO pointer would be a moot
operation, because as the checkpoint completes, the progress LSN will
be updated as well. Or do you mean that the progress LSN should *not*
be updated for a checkpoint record? It seems to me that it should
but...

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..7ecc00e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
Hm. Is it a good idea to use a static for this? Did you consider
checkpointer restarts?

Indeed, I forgot about that and the current approach is not solid. The
best way to do things then is to track the LSN position of the last
switched segment in XLogCtl..
--
Michael

Attachments:

hs-checkpoints-v17.patchtext/x-diff; charset=US-ASCII; name=hs-checkpoints-v17.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..d2a8ec2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,17 +2826,16 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
-        <varname>archive_timeout</> &mdash; it will bloat your archive
-        storage.  <varname>archive_timeout</> settings of a minute or so are
-        usually reasonable.  You should consider using streaming replication,
-        instead of archiving, if you want data to be copied off the master
-        server more quickly than that.
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
+        Note that archived files that are closed early due to a forced switch
+        are still the same length as completely full files.  Therefore, it is
+        unwise to use a very short <varname>archive_timeout</> &mdash; it will
+        bloat your archive storage.  <varname>archive_timeout</> settings of
+        a minute or so are usually reasonable.  You should consider using
+        streaming replication, instead of archiving, if you want data to
+        be copied off the master server more quickly than that.
         This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..894596b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,22 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -542,8 +553,9 @@ typedef struct XLogCtlData
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
 
-	/* Time of last xlog segment switch. Protected by WALWriteLock. */
+	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
+	XLogRecPtr	lastSegSwitchLSN;
 
 	/*
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
@@ -885,6 +897,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +912,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1014,23 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions as at least one WAL insertion lock
+	 * is already taken appropriately before doing that. Progress is set at
+	 * the start position of the tracked record that is being added, making
+	 * checkpoint progress tracking easier as the control file already saves
+	 * the start LSN position of the last checkpoint. If an exclusive lock
+	 * is taken for WAL insertion there is no need to update all the progress
+	 * fields, only the first one.
+	 */
+	if ((flags & XLOG_SKIP_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -2333,6 +2367,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					XLogArchiveNotifySeg(openLogSegNo);
 
 				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
 				/*
 				 * Request a checkpoint if we've consumed too much xlog since
@@ -4720,6 +4755,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7436,8 +7472,9 @@ StartupXLOG(void)
 	 */
 	InRecovery = false;
 
-	/* start the archive_timeout timer running */
+	/* start the archive_timeout timer and LSN running */
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -7999,16 +8036,66 @@ GetFlushRecPtr(void)
 }
 
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not requiring a checkpoint to be triggered.
+ * Finding the last activity position is done by scanning each WAL
+ * insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
+ * Get the time and LSN of the last xlog segment switch
  */
 pg_time_t
-GetLastSegSwitchTime(void)
+GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 {
 	pg_time_t	result;
 
 	/* Need WALWriteLock, but shared lock is sufficient */
 	LWLockAcquire(WALWriteLock, LW_SHARED);
 	result = XLogCtl->lastSegSwitchTime;
+	*lastSwitchLSN = XLogCtl->lastSegSwitchLSN;
 	LWLockRelease(WALWriteLock);
 
 	return result;
@@ -8258,7 +8345,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8426,32 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been
+	 * no WAL activity requiring a checkpoint, skip it.  The idea here is to
+	 * avoid inserting duplicate checkpoints when the system is idle. That
+	 * wastes log space, and more importantly it exposes us to possible loss
+	 * of both current and previous checkpoint records if the machine crashes
+	 * just as we're writing the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			ereport(DEBUG1, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9217,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..720c754 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 curinsert_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	curinsert_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_SKIP_PROGRESS, to not update the WAL progress trackers when
+ *   inserting the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	curinsert_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..172129f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,25 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if one of the following conditions is satisfied since
+			 * the last time we came here::
+			 * - timeout has been reached.
+			 * - WAL activity has happened since last checkpoint.
+			 * - New WAL records have been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..e3feb17 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -580,6 +580,7 @@ CheckArchiveTimeout(void)
 {
 	pg_time_t	now;
 	pg_time_t	last_time;
+	XLogRecPtr	last_switch_lsn;
 
 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
 		return;
@@ -594,26 +595,37 @@ CheckArchiveTimeout(void)
 	 * Update local state ... note that last_xlog_switch_time is the last time
 	 * a switch was performed *or requested*.
 	 */
-	last_time = GetLastSegSwitchTime();
+	last_time = GetLastSegSwitchData(&last_switch_lsn);
 
 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
 
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual triggering of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..04ef7dd 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..aba00e2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_SKIP_PROGRESS	0x02	/* skip update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..b2a8d03 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -283,7 +283,7 @@ extern const RmgrData RmgrTable[];
 /*
  * Exported to support xlog switching from checkpointer
  */
-extern pg_time_t GetLastSegSwitchTime(void);
+extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#32

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#31)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hello,

It applies the master and compiled cleanly and no error by
regtest. (I didn't confirmed that the problem is still fixed but
seemingly no problem)

At Mon, 14 Nov 2016 15:09:09 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRhzS0fNHNAAtRCE+CqdOKKW+KyrAzy5O_R-7zqucGevA@mail.gmail.com>

On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
+ * This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
Something residing on the same cache line doens't provide that guarantee
on all platforms.
OK. Let's remove it then.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
Commit records still have to be written, everything else doesn't write
WAL. So I'm doubtful this matters much?
Hm, okay. In most cases this may not matter... Let's rip that off.
@@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
inserted = true;
}
+     /*
+      * Update the LSN progress positions. At least one WAL insertion lock
+      * is already taken appropriately before doing that, and it is simpler
+      * to do that here when the WAL record data and type are at hand.
But we don't use the "WAL record data and type"?
Yes, at some point this patch did so...
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not referring to standby logging or segment
+ * switches. Finding the last activity position is done by scanning each
+ * WAL insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
I'd prefer not to list the specific records here - that's just
guaranteed to get out of date. Why not say something "any activity not
requiring a checkpoint to be triggered" or such?
OK. Makes sense to minimize maintenance.
+      * If this isn't a shutdown or forced checkpoint, and if there has been no
+      * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+      * duplicate checkpoints when the system is idle. That wastes log space,
+      * and more importantly it exposes us to possible loss of both current and
+      * previous checkpoint records if the machine crashes just as we're writing
+      * the update.
Shouldn't this mention archiving and also that we also ignore some forms
of WAL activity?
I have reworded that as:
"If this isn't a shutdown or forced checkpoint, and if there has been
no WAL activity requiring a checkpoint, skip it."
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
that seems a bit too generic a name. 'curinsert_flags'?
OK.
/*
-                      * only log if enough time has passed and some xlog record has
-                      * been inserted.
+                      * Only log if enough time has passed, that some WAL activity
+                      * has happened since last checkpoint, and that some new WAL
+                      * records have been inserted since the last time we came here.
I think that sentence needs some polish.
Let's do this better:
/*
-            * only log if enough time has passed and some xlog record has
-            * been inserted.
+            * Only log if one of the following conditions is satisfied since
+            * the last time we came here::
+            * - timeout has been reached.
+            * - WAL activity has happened since last checkpoint.
+            * - New WAL records have been inserted.
*/
*/
if (now >= timeout &&
-                             last_snapshot_lsn != GetXLogInsertRecPtr())
+                             GetLastCheckpointRecPtr() < current_progress_lsn &&
+                             last_progress_lsn < current_progress_lsn)
{
Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
Don't we need to do the comparisons here (and when doing the checkpoint
itself) with the REDO pointer of the last checkpoint?
Hm? The progress pointer is pointing to the lastly inserted LSN, which
is not the position of the REDO pointer, but the one of the checkpoint
record. Doing a comparison of the REDO pointer would be a moot
operation, because as the checkpoint completes, the progress LSN will
be updated as well. Or do you mean that the progress LSN should *not*
be updated for a checkpoint record? It seems to me that it should
but...
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..7ecc00e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
static pg_time_t last_checkpoint_time;
static pg_time_t last_xlog_switch_time;
+static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
Hm. Is it a good idea to use a static for this? Did you consider
checkpointer restarts?
Indeed, I forgot about that and the current approach is not solid. The
best way to do things then is to track the LSN position of the last
switched segment in XLogCtl..

If I'm not missing something, at the worst we have a checkpoint
after a checkpointer restart that should have been supressed. Is
it worth picking it up for the complexity?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#32)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

It applies the master and compiled cleanly and no error by
regtest. (I didn't confirmed that the problem is still fixed but
seemingly no problem)

Thanks for double-checking.

If I'm not missing something, at the worst we have a checkpoint
after a checkpointer restart that should have been supressed. Is
it worth picking it up for the complexity?

I think so, that's not that much code if you think about it as there
is already a routine to get the timestamp of the lastly switched
segment that gets used by checkpointer.c.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Michael Paquier (#30)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>

I think it is good to check the performance impact of this patch on
many core m/c. Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?

I heard about a number like that, and there is no reason to not do
tests to be sure. With that many cores we are more likely going to see
the limitation of the number of XLOG insert slots popping up as a
bottleneck, but that's just an assumption without any numbers.
Alexander (added in CC), could it be possible to get an access to this
machine?

@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
xl_standby_lock *locks)
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+ XLogSetFlags(XLOG_NO_PROGRESS);

Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.

This does not remove any record from WAL. So theoretically any
kind of record can be NO_PROGRESS, but practically as long as
checkpoints are not unreasonably suppressed. Any explicit
database operation must be accompanied with at least commit
record that triggers checkpoint. NO_PROGRESSing there doesn't
seem to me to harm database durability for this reason.

By this theory, you can even mark the insert record as no progress
which is not good.

The objective of this patch is skipping WALs on completely-idle
state and the NO_PROGRESSing is necessary to do its work. Of
course we can distinguish exclusive lock with PROGRESS and
without PROGRESS but it is unnecessary complexity.

The point that applies here is that logging the exclusive lock
information is necessary for the *standby* recovery conflicts, not the
primary which is why it should not influence the checkpoint activity
that is happening on the primary. So marking this record with
NO_PROGRESS is actually fine to me.

The progress parameter is used not only for checkpoint activity but by
bgwriter as well for logging standby snapshot. If you want to keep
this record under no_progress category (which I don't endorse), then
it might be better to add a comment, so that it will be easier for the
readers of this code to understand the reason.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Amit Kapila (#34)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hello,

At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1KJAXA3PdxH4T1QJKBNOvyUK8UKm_GCvTuT+FC5jpjmjg@mail.gmail.com>

On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier

Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.

This does not remove any record from WAL. So theoretically any
kind of record can be NO_PROGRESS, but practically as long as
checkpoints are not unreasonably suppressed. Any explicit
database operation must be accompanied with at least commit
record that triggers checkpoint. NO_PROGRESSing there doesn't
seem to me to harm database durability for this reason.

By this theory, you can even mark the insert record as no progress
which is not good.

Of course. So we carefully choose the kinds of records to be
so. If we mark all xlog records to be SKIP_PROGRESS,
archive_timeout gets useless and as its result vacuum may leave
certain number of records not removed for maybe problematic time.

The objective of this patch is skipping WALs on completely-idle
state and the NO_PROGRESSing is necessary to do its work. Of
course we can distinguish exclusive lock with PROGRESS and
without PROGRESS but it is unnecessary complexity.

The point that applies here is that logging the exclusive lock
information is necessary for the *standby* recovery conflicts, not the
primary which is why it should not influence the checkpoint activity
that is happening on the primary. So marking this record with
NO_PROGRESS is actually fine to me.

The progress parameter is used not only for checkpoint activity but by
bgwriter as well for logging standby snapshot. If you want to keep
this record under no_progress category (which I don't endorse), then
it might be better to add a comment, so that it will be easier for the
readers of this code to understand the reason.

I rather agree to that. But how detailed it should be?

LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
{
...
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
/* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */
XLogSetFlags(XLOG_SKIP_PROGRESS);

/*
* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot.
* See the comment for LogCurrentRunningXact for the detail.
*/

or more detiled?

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#35)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

The progress parameter is used not only for checkpoint activity but by
bgwriter as well for logging standby snapshot. If you want to keep
this record under no_progress category (which I don't endorse), then
it might be better to add a comment, so that it will be easier for the
readers of this code to understand the reason.

I rather agree to that. But how detailed it should be?

LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
{
...
XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
/* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */
XLogSetFlags(XLOG_SKIP_PROGRESS);

or

/*
* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot.
* See the comment for LogCurrentRunningXact for the detail.
*/

or more detiled?

I think referring to a place where we have explained why skipping XLOG
progress is okay for this or related WAL records (like comments for
struct WALInsertLock) will be more suitable. Also, maybe it is worth
mentioning that this code will skip updating XLOG progress even when
we want to log AccessExclusiveLocks for operations other than a
snapshot.

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

I would have written it as below:

GetProgressRecPtr -- Returns the WAL progress. WAL progress is
determined by scanning each WALinsertion lock by taking directly the
light-weight lock associated to it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Michael Paquier (#33)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 11/14/16 4:29 AM, Michael Paquier wrote:

On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

It applies the master and compiled cleanly and no error by
regtest. (I didn't confirmed that the problem is still fixed but
seemingly no problem)

Thanks for double-checking.

Also looks good to me. I like curinsert_flags and XLOG_SKIP_PROGRESS
better than the old names.

If I'm not missing something, at the worst we have a checkpoint
after a checkpointer restart that should have been supressed. Is
it worth picking it up for the complexity?

That's the way I read it as well. It's not clear to me how the
checkpointer would get restarted under normal circumstances.

I did a kill on the checkpointer and it was ignored. After a kill -9
the checkpointer process came back but also switched the xlog. Is this
the expected behavior?

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Michael Paquier (#30)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com>

wrote in <CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com

I think it is good to check the performance impact of this patch on
many core m/c. Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?

I heard about a number like that, and there is no reason to not do
tests to be sure.

Okay, I have done some performance tests with this patch and found that it
doesn't have any noticeable impact which is good. Details of performance
tests is below:
Machine configuration:
2 sockets, 28 cores (56 including Hyper-Threading)
RAM = 64GB
Data directory is configured on the magnetic disk and WAL on SSD.

Non-default postgresql.conf parameters
shared_buffers=8GB
max_connections=200
bgwriter_delay=10ms
checkpoint_completion_target=0

Keeping above parameters as fixed, I have varied checkpoint_timeout for
various tests. Each of the below results is a median of 3, 15min pgbench
TPC-B tests. All the tests are performed at 64 and or 128 client-count
(Client Count = number of concurrent sessions and threads (ex. -c 8 -j
8)). All the tests are done for pgbench scale factor - 300 which means
data fits in shared buffers.

checkpoint_timeout=30s
client_count/patch_ver 64 128
HEAD 5176 6853
Patch 4963 6556
checkpoint_timeout=60s
client_count/patch_ver
64 128
HEAD 4962 6894
Patch 5228 6814
checkpoint_timeout=120s
client_count/patch_ver
64 128
HEAD 5443 7308
Patch 5453 6937
checkpoint_timeout=150s
client_count/patch_ver
128
HEAD 7316
Patch 7188

In above results, you can see that in some cases (example, for
checkpoint_time=30s, @128-client count) TPS with the patch is slightly
lower(1~5%), but I find that as a run-to-run variation, because on
repeating the tests, I could not see such regression. The reason of
keeping low values for checkpoint_timeout and bgwriter_delay is to test if
there is any impact due to new locking introduced in checkpointer and
bgwriter. The conclusion from my tests is that this patch is okay as far
as performance is concerned.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#39

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Amit Kapila (#38)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Nov 18, 2016 at 7:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have done some performance tests with this patch and found that it doesn't have any noticeable impact which is good. Details of performance tests is below:
Machine configuration:
2 sockets, 28 cores (56 including Hyper-Threading)
RAM = 64GB
Data directory is configured on the magnetic disk and WAL on SSD.

Nice spec!

The conclusion from my tests is that this patch is okay as far as performance is concerned.

Thank you a lot for doing those additional tests!
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

David Steele

david@pgmasters.net

about 9 years ago

In reply to: David Steele (#37)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 11/18/16 12:38 PM, David Steele wrote:

On 11/14/16 4:29 AM, Michael Paquier wrote:

On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI

If I'm not missing something, at the worst we have a checkpoint
after a checkpointer restart that should have been supressed. Is
it worth picking it up for the complexity?

That's the way I read it as well. It's not clear to me how the
checkpointer would get restarted under normal circumstances.

I did a kill on the checkpointer and it was ignored. After a kill -9
the checkpointer process came back but also switched the xlog. Is this
the expected behavior?

Ah, never mind. I can see this caused a restart and recovery so the
archive timeout was reset and a switch occurred after timeout.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#39)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Thank you very much for the testing on the nice machine.

At Fri, 18 Nov 2016 20:35:43 -0800, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRa=igQMCx+FxbfwJ0TzhLU2tE+YOng7qAvZ+1NPm-FOw@mail.gmail.com>

On Fri, Nov 18, 2016 at 7:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have done some performance tests with this patch and found that it doesn't have any noticeable impact which is good. Details of performance tests is below:
Machine configuration:
2 sockets, 28 cores (56 including Hyper-Threading)
RAM = 64GB
Data directory is configured on the magnetic disk and WAL on SSD.

Nice spec!

This spec seems enough to see the performance of this patch.

The conclusion from my tests is that this patch is okay as far as performance is concerned.

Thank you a lot for doing those additional tests!

So, all my original concern were cleared. The last one is
resetting by a checkpointer restart.. I'd like to remove that if
Andres agrees.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Amit Kapila (#36)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

I would have written it as below:

GetProgressRecPtr -- Returns the WAL progress. WAL progress is
determined by scanning each WALinsertion lock by taking directly the
light-weight lock associated to it.

Not sure if that's better.. What about something as fancy as that?
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ *
It may be worth mentioning that the result of this routine is
basically used for checkpoint skip logic.
-- 
Michael

Attachments:

hs-checkpoints-v18.patchtext/plain; charset=US-ASCII; name=hs-checkpoints-v18.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..d2a8ec2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,17 +2826,16 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
-        <varname>archive_timeout</> &mdash; it will bloat your archive
-        storage.  <varname>archive_timeout</> settings of a minute or so are
-        usually reasonable.  You should consider using streaming replication,
-        instead of archiving, if you want data to be copied off the master
-        server more quickly than that.
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
+        Note that archived files that are closed early due to a forced switch
+        are still the same length as completely full files.  Therefore, it is
+        unwise to use a very short <varname>archive_timeout</> &mdash; it will
+        bloat your archive storage.  <varname>archive_timeout</> settings of
+        a minute or so are usually reasonable.  You should consider using
+        streaming replication, instead of archiving, if you want data to
+        be copied off the master server more quickly than that.
         This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..cfbf584 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,22 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -542,8 +553,9 @@ typedef struct XLogCtlData
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
 
-	/* Time of last xlog segment switch. Protected by WALWriteLock. */
+	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
+	XLogRecPtr	lastSegSwitchLSN;
 
 	/*
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
@@ -885,6 +897,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +912,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1014,23 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions as at least one WAL insertion lock
+	 * is already taken appropriately before doing that. Progress is set at
+	 * the start position of the tracked record that is being added, making
+	 * checkpoint progress tracking easier as the control file already saves
+	 * the start LSN position of the last checkpoint. If an exclusive lock
+	 * is taken for WAL insertion there is no need to update all the progress
+	 * fields, only the first one.
+	 */
+	if ((flags & XLOG_SKIP_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -2333,6 +2367,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					XLogArchiveNotifySeg(openLogSegNo);
 
 				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
 				/*
 				 * Request a checkpoint if we've consumed too much xlog since
@@ -4720,6 +4755,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7436,8 +7472,9 @@ StartupXLOG(void)
 	 */
 	InRecovery = false;
 
-	/* start the archive_timeout timer running */
+	/* start the archive_timeout timer and LSN running */
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -7999,16 +8036,66 @@ GetFlushRecPtr(void)
 }
 
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
+ * Get the time and LSN of the last xlog segment switch
  */
 pg_time_t
-GetLastSegSwitchTime(void)
+GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 {
 	pg_time_t	result;
 
 	/* Need WALWriteLock, but shared lock is sufficient */
 	LWLockAcquire(WALWriteLock, LW_SHARED);
 	result = XLogCtl->lastSegSwitchTime;
+	*lastSwitchLSN = XLogCtl->lastSegSwitchLSN;
 	LWLockRelease(WALWriteLock);
 
 	return result;
@@ -8258,7 +8345,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8426,32 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been
+	 * no WAL activity requiring a checkpoint, skip it.  The idea here is to
+	 * avoid inserting duplicate checkpoints when the system is idle. That
+	 * wastes log space, and more importantly it exposes us to possible loss
+	 * of both current and previous checkpoint records if the machine crashes
+	 * just as we're writing the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			ereport(DEBUG1, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9217,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..720c754 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 curinsert_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	curinsert_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_SKIP_PROGRESS, to not update the WAL progress trackers when
+ *   inserting the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	curinsert_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..172129f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,25 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if one of the following conditions is satisfied since
+			 * the last time we came here::
+			 * - timeout has been reached.
+			 * - WAL activity has happened since last checkpoint.
+			 * - New WAL records have been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..e3feb17 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -580,6 +580,7 @@ CheckArchiveTimeout(void)
 {
 	pg_time_t	now;
 	pg_time_t	last_time;
+	XLogRecPtr	last_switch_lsn;
 
 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
 		return;
@@ -594,26 +595,37 @@ CheckArchiveTimeout(void)
 	 * Update local state ... note that last_xlog_switch_time is the last time
 	 * a switch was performed *or requested*.
 	 */
-	last_time = GetLastSegSwitchTime();
+	last_time = GetLastSegSwitchData(&last_switch_lsn);
 
 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
 
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual triggering of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..04ef7dd 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..aba00e2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_SKIP_PROGRESS	0x02	/* skip update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..b2a8d03 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -283,7 +283,7 @@ extern const RmgrData RmgrTable[];
 /*
  * Exported to support xlog switching from checkpointer
  */
-extern pg_time_t GetLastSegSwitchTime(void);
+extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#43

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#41)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

So, all my original concern were cleared.

Cool. Perhaps this could be marked as ready for committer then?

The last one is
resetting by a checkpointer restart.. I'd like to remove that if
Andres agrees.

Could you clarify this point? v18 makes sure that the last segment
switch stays in shared memory so as we could still skip the activity
of archive_timeout correctly.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#43)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hello,

At Mon, 21 Nov 2016 14:41:27 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSetnFjhGAB+tE2M68Vc_3BwbsEPe+dCMB8xnH0UYw3aA@mail.gmail.com>

On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

So, all my original concern were cleared.

Cool. Perhaps this could be marked as ready for committer then?

^^;

The last one is
resetting by a checkpointer restart.. I'd like to remove that if
Andres agrees.

Could you clarify this point? v18 makes sure that the last segment
switch stays in shared memory so as we could still skip the activity
of archive_timeout correctly.

I don't doubt that it works. (I don't comment on the comment:) My
concern is complexity. I don't think we wish to save almost no
harm behavior caused by a thing rarely happens. But, if you and
others on this thread don't mind the complexity, It's not worth
asserting myself more.

So, after a day waiting, I'll mark this as ready for committer
again.

reagards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#44)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

I almost forgot this.

At Mon, 21 Nov 2016 15:44:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161121.154408.47398334.horiguchi.kyotaro@lab.ntt.co.jp>

Hello,

At Mon, 21 Nov 2016 14:41:27 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSetnFjhGAB+tE2M68Vc_3BwbsEPe+dCMB8xnH0UYw3aA@mail.gmail.com>

On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

So, all my original concern were cleared.

Cool. Perhaps this could be marked as ready for committer then?

^^;

The last one is
resetting by a checkpointer restart.. I'd like to remove that if
Andres agrees.

Could you clarify this point? v18 makes sure that the last segment
switch stays in shared memory so as we could still skip the activity
of archive_timeout correctly.

I don't doubt that it works. (I don't comment on the comment:) My
concern is complexity. I don't think we wish to save almost no
harm behavior caused by a thing rarely happens. But, if you and
others on this thread don't mind the complexity, It's not worth
asserting myself more.

So, after a day waiting, I'll mark this as ready for committer
again.

I have marked this as ready for committer again.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#45)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Tue, Nov 22, 2016 at 6:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I have marked this as ready for committer again.

And moved to next CF for now.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Michael Paquier (#42)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Mon, Nov 21, 2016 at 11:08 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

I would have written it as below:

GetProgressRecPtr -- Returns the WAL progress. WAL progress is
determined by scanning each WALinsertion lock by taking directly the
light-weight lock associated to it.
Not sure if that's better.. What about something as fancy as that?
/*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ *
It may be worth mentioning that the result of this routine is
basically used for checkpoint skip logic.

That's okay, but I think you are using it to skip switch segment stuff
as well. Today, again going through patch, I noticed small anomaly

+ * Switch segment only when WAL has done some progress since the

+ * > last time a segment has switched because of a timeout.

+ if (GetProgressRecPtr() > last_switch_lsn)

Either the above comment is wrong or the code after it has a problem.
last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
a timeout but also when there is a lot of WAL activity which makes WAL
Write to cross a segment boundary.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Amit Kapila (#47)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Wed, Nov 30, 2016 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 21, 2016 at 11:08 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

I would have written it as below:

GetProgressRecPtr -- Returns the WAL progress. WAL progress is
determined by scanning each WALinsertion lock by taking directly the
light-weight lock associated to it.
Not sure if that's better.. What about something as fancy as that?
/*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ *
It may be worth mentioning that the result of this routine is
basically used for checkpoint skip logic.
That's okay, but I think you are using it to skip switch segment stuff
as well. Today, again going through patch, I noticed small anomaly

+ * Switch segment only when WAL has done some progress since the

+ * > last time a segment has switched because of a timeout.

+ if (GetProgressRecPtr() > last_switch_lsn)

Either the above comment is wrong or the code after it has a problem.
last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
a timeout but also when there is a lot of WAL activity which makes WAL
Write to cross a segment boundary.

Right, this should be reworded a bit better to mention both. Done as attached.
--
Michael

Attachments:

hs-checkpoints-v19.patchinvalid/octet-stream; name=hs-checkpoints-v19.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b1c5289..7b56cd1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2825,17 +2825,16 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
-        <varname>archive_timeout</> &mdash; it will bloat your archive
-        storage.  <varname>archive_timeout</> settings of a minute or so are
-        usually reasonable.  You should consider using streaming replication,
-        instead of archiving, if you want data to be copied off the master
-        server more quickly than that.
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
+        Note that archived files that are closed early due to a forced switch
+        are still the same length as completely full files.  Therefore, it is
+        unwise to use a very short <varname>archive_timeout</> &mdash; it will
+        bloat your archive storage.  <varname>archive_timeout</> settings of
+        a minute or so are usually reasonable.  You should consider using
+        streaming replication, instead of archiving, if you want data to
+        be copied off the master server more quickly than that.
         This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d643216..8463bb3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5234,7 +5234,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 084401d..bbc56b0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,22 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -542,8 +553,9 @@ typedef struct XLogCtlData
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
 
-	/* Time of last xlog segment switch. Protected by WALWriteLock. */
+	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
+	XLogRecPtr	lastSegSwitchLSN;
 
 	/*
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
@@ -885,6 +897,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +912,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1014,23 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions as at least one WAL insertion lock
+	 * is already taken appropriately before doing that. Progress is set at
+	 * the start position of the tracked record that is being added, making
+	 * checkpoint progress tracking easier as the control file already saves
+	 * the start LSN position of the last checkpoint. If an exclusive lock
+	 * is taken for WAL insertion there is no need to update all the progress
+	 * fields, only the first one.
+	 */
+	if ((flags & XLOG_SKIP_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -2333,6 +2367,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					XLogArchiveNotifySeg(openLogSegNo);
 
 				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
 				/*
 				 * Request a checkpoint if we've consumed too much xlog since
@@ -4720,6 +4755,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7436,8 +7472,9 @@ StartupXLOG(void)
 	 */
 	InRecovery = false;
 
-	/* start the archive_timeout timer running */
+	/* start the archive_timeout timer and LSN running */
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -7999,16 +8036,66 @@ GetFlushRecPtr(void)
 }
 
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
+ * Get the time and LSN of the last xlog segment switch
  */
 pg_time_t
-GetLastSegSwitchTime(void)
+GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 {
 	pg_time_t	result;
 
 	/* Need WALWriteLock, but shared lock is sufficient */
 	LWLockAcquire(WALWriteLock, LW_SHARED);
 	result = XLogCtl->lastSegSwitchTime;
+	*lastSwitchLSN = XLogCtl->lastSegSwitchLSN;
 	LWLockRelease(WALWriteLock);
 
 	return result;
@@ -8258,7 +8345,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8426,32 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been
+	 * no WAL activity requiring a checkpoint, skip it.  The idea here is to
+	 * avoid inserting duplicate checkpoints when the system is idle. That
+	 * wastes log space, and more importantly it exposes us to possible loss
+	 * of both current and previous checkpoint records if the machine crashes
+	 * just as we're writing the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			ereport(DEBUG1, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9217,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..720c754 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 curinsert_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	curinsert_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_SKIP_PROGRESS, to not update the WAL progress trackers when
+ *   inserting the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	curinsert_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a31d44e..0ad402e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -79,12 +79,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -310,7 +310,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -319,19 +319,25 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if one of the following conditions is satisfied since
+			 * the last time we came here::
+			 * - timeout has been reached.
+			 * - WAL activity has happened since last checkpoint.
+			 * - New WAL records have been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 92b0a94..532d3c0 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -582,6 +582,7 @@ CheckArchiveTimeout(void)
 {
 	pg_time_t	now;
 	pg_time_t	last_time;
+	XLogRecPtr	last_switch_lsn;
 
 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
 		return;
@@ -596,26 +597,37 @@ CheckArchiveTimeout(void)
 	 * Update local state ... note that last_xlog_switch_time is the last time
 	 * a switch was performed *or requested*.
 	 */
-	last_time = GetLastSegSwitchTime();
+	last_time = GetLastSegSwitchData(&last_switch_lsn);
 
 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
 
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout or because
+		 * of some WAL activity.  Segment switching because of other
+		 * reasons, like manual triggering of pg_switch_xlog() as well as
+		 * this automatic switch, will not cause any progress in WAL.  Note
+		 * that RequestXLogSwitch() may return the beginning of a segment,
+		 * which is fine to prevent any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..04ef7dd 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..aba00e2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_SKIP_PROGRESS	0x02	/* skip update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..b2a8d03 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -283,7 +283,7 @@ extern const RmgrData RmgrTable[];
 /*
  * Exported to support xlog switching from checkpointer
  */
-extern pg_time_t GetLastSegSwitchTime(void);
+extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

#49

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Michael Paquier (#48)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Dec 2, 2016 at 9:50 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 30, 2016 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ * Switch segment only when WAL has done some progress since the

+ * > last time a segment has switched because of a timeout.

+ if (GetProgressRecPtr() > last_switch_lsn)

Either the above comment is wrong or the code after it has a problem.
last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
a timeout but also when there is a lot of WAL activity which makes WAL
Write to cross a segment boundary.

Right, this should be reworded a bit better to mention both. Done as attached.

+ * Switch segment only when WAL has done some progress since the
+ * last time a segment has switched because of a timeout or because
+ * of some WAL activity.

I think it could be better written as below, but it is up to you to
retain your version or use below one.

Switch segment only when WAL has done some progress since the last
time a segment has switched due to timeout or WAL activity. Apart
from that patch looks good to me.

Note to Committer: As discussed above [1]/messages/by-id/CAA4eK1KJAXA3PdxH4T1QJKBNOvyUK8UKm_GCvTuT+FC5jpjmjg@mail.gmail.com, this patch skips logging
for LogAccessExclusiveLocks which can be called from multiple places,
so for clarity purpose either we should document it or skip it only
when absolutely necessary.

[1]: /messages/by-id/CAA4eK1KJAXA3PdxH4T1QJKBNOvyUK8UKm_GCvTuT+FC5jpjmjg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Michael Paquier (#48)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi,

A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Working on committing this (tomorrow morning, not tonight). There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what
that those flags refer to - it could just as well be replay
related. XLogSetRecordFlags()?
- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?
- It's currently required to avoid triggering archive timeouts and
checkpoints triggering each other, but I'm nervous marking all xlog
switches as unimportant. I think it'd be better to only mark timeout
triggered switches as such.

Otherwise this seems to look good.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#50)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:

- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?

Whoa. -1 from me for "consistency LSN". Consistency has to with
whether the cluster has recovered up to the minimum recovery point or
whatever -- that is -- questions like "am i going to run into torn
pages?" and "should I expect some heap tuples to maybe be missing
index tuples, or the other way around?". What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent. I don't
mind changing the name, but not to consistency LSN.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Robert Haas (#51)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 2016-12-21 16:35:28 -0500, Robert Haas wrote:

On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:

- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?

Whoa. -1 from me for "consistency LSN". Consistency has to with
whether the cluster has recovered up to the minimum recovery point or
whatever -- that is -- questions like "am i going to run into torn
pages?" and "should I expect some heap tuples to maybe be missing
index tuples, or the other way around?".

That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.

What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent. I don't
mind changing the name, but not to consistency LSN.

Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated....

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Andres Freund (#50)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi Andres,

On 12/21/16 4:28 PM, Andres Freund wrote:

Working on committing this (tomorrow morning, not tonight). There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what
that those flags refer to - it could just as well be replay
related. XLogSetRecordFlags()?

That sounds a bit more clear.

- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?

Yes, please. I think that really cuts to the core of what the patch is
about. Progress made perfect sense to me, but consistency is always the
goal, and what we are saying here is that this is the last xlog record
that is required to achieve consistency. Anything that happens to be
after it is informational only.

- It's currently required to avoid triggering archive timeouts and
checkpoints triggering each other, but I'm nervous marking all xlog
switches as unimportant. I think it'd be better to only mark timeout
triggered switches as such.

That seems fine to me. If the system is truly idle that might trigger
one more xlog switch that is needed, but it seems like a reasonable
compromise.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

David Steele

david@pgmasters.net

about 9 years ago

In reply to: Andres Freund (#52)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On 12/21/16 4:40 PM, Andres Freund wrote:

On 2016-12-21 16:35:28 -0500, Robert Haas wrote:

What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent. I don't
mind changing the name, but not to consistency LSN.

Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated....

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.

MinConsistencyLSN?

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

David G. Johnston

david.g.johnston@gmail.com

about 9 years ago

In reply to: Andres Freund (#52)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Wed, Dec 21, 2016 at 2:40 PM, Andres Freund <andres@anarazel.de> wrote:

That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.

The above, to me, describes a "milestone LSN"...

David J.

#56

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Andres Freund (#50)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, Dec 22, 2016 at 6:28 AM, Andres Freund <andres@anarazel.de> wrote:

A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Indeed. I am not sure what kind of accident happened here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: David Steele (#53)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, Dec 22, 2016 at 6:41 AM, David Steele <david@pgmasters.net> wrote:

On 12/21/16 4:28 PM, Andres Freund wrote:

Working on committing this (tomorrow morning, not tonight). There's
some relatively minor things I want to change:

Thanks for looking at this patch.

- I don't like the name XLogSetFlags() - it's completely unclear what
that those flags refer to - it could just as well be replay
related. XLogSetRecordFlags()?

That sounds a bit more clear.

Fine for me.

- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?

Yes, please. I think that really cuts to the core of what the patch is
about. Progress made perfect sense to me, but consistency is always the
goal, and what we are saying here is that this is the last xlog record that
is required to achieve consistency. Anything that happens to be after it is
informational only.

Fine as well.

- It's currently required to avoid triggering archive timeouts and
checkpoints triggering each other, but I'm nervous marking all xlog
switches as unimportant. I think it'd be better to only mark timeout
triggered switches as such.

That seems fine to me. If the system is truly idle that might trigger one
more xlog switch that is needed, but it seems like a reasonable compromise.

On a long-running embedded system the difference won't matter much. So
I guess I'm fine with this bit as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Andres Freund (#52)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, Dec 22, 2016 at 3:10 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-12-21 16:35:28 -0500, Robert Haas wrote:

On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:

- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?

Whoa. -1 from me for "consistency LSN". Consistency has to with
whether the cluster has recovered up to the minimum recovery point or
whatever -- that is -- questions like "am i going to run into torn
pages?" and "should I expect some heap tuples to maybe be missing
index tuples, or the other way around?".

That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.

What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent. I don't
mind changing the name, but not to consistency LSN.

Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated....

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.

last_essential_LSN?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Andres Freund (#50)

1 attachment(s)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi,

On 2016-12-21 13:28:54 -0800, Andres Freund wrote:

A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Working on committing this (tomorrow morning, not tonight). There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what
that those flags refer to - it could just as well be replay
related. XLogSetRecordFlags()?
- Similarly I don't like the name "progress LSN" much. What does
"progress" really mean in that". Maybe "consistency LSN"?
- It's currently required to avoid triggering archive timeouts and
checkpoints triggering each other, but I'm nervous marking all xlog
switches as unimportant. I think it'd be better to only mark timeout
triggered switches as such.

Here's an updated version of this. Besides the above (with "consistency
LSN" now named "lastImportantAt" instead of the previous
lastProgressAt), I changed how the skipping works in the bgwriter: I
don't see any need to involve the checkpoint location there. This also
allows to drop GetLastCheckpointPtr(). Besides that I did a fair amount
of comment-smithing.

I plan to commit this later today. Hope I got the reviewers roughly right.

Regards,

Andres

Attachments:

0001-Skip-checkpoints-archiving-on-idle-systems.patchtext/x-patch; charset=us-asciiDownload

From c5c9e9ce114d5c058e171caaa172ccb2ac066f13 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 22 Dec 2016 08:31:32 -0800
Subject: [PATCH] Skip checkpoints, archiving on idle systems.

Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
---
 doc/src/sgml/config.sgml                  |  10 +--
 src/backend/access/heap/heapam.c          |  10 +--
 src/backend/access/transam/xact.c         |   2 +-
 src/backend/access/transam/xlog.c         | 118 +++++++++++++++++++++++-------
 src/backend/access/transam/xlogfuncs.c    |   2 +-
 src/backend/access/transam/xloginsert.c   |  24 ++++--
 src/backend/postmaster/bgwriter.c         |   8 +-
 src/backend/postmaster/checkpointer.c     |  45 ++++++++----
 src/backend/replication/logical/message.c |   2 +-
 src/backend/storage/ipc/standby.c         |  11 ++-
 src/include/access/xlog.h                 |  12 ++-
 src/include/access/xlog_internal.h        |   4 +-
 src/include/access/xloginsert.h           |   2 +-
 13 files changed, 173 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1b98c416e0..b6b20a368e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2852,12 +2852,10 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
+        including a single checkpoint (checkpoints are skipped if there is
+        no database activity).  Note that archived files that are closed
+        early due to a forced switch are still the same length as completely
+        full files.  Therefore, it is unwise to use a very short
         <varname>archive_timeout</> &mdash; it will bloat your archive
         storage.  <varname>archive_timeout</> settings of a minute or so are
         usually reasonable.  You should consider using streaming replication,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1a0d..ea579a00be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d6432165f1..e47fd4497e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5234,7 +5234,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa9ee5a0dd..f8ffa5c45c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,21 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * lastImportantAt contains the LSN of the last important WAL record inserted
+ * using a given lock. This value is used to detect if there has been
+ * important WAL activity since the last time some action, like a checkpoint,
+ * was performed - allowing to not repeat the action if not. The LSN is
+ * updated for all insertions, unless the XLOG_MARK_UNIMPORTANT flag was
+ * set. lastImportantAt is never cleared, only overwritten by the LSN of newer
+ * records.  Tracking the WAL activity directly in WALInsertLock has the
+ * advantage of not needing any additional locks to update the value.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	lastImportantAt;
 } WALInsertLock;
 
 /*
@@ -541,8 +551,9 @@ typedef struct XLogCtlData
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
 
-	/* Time of last xlog segment switch. Protected by WALWriteLock. */
+	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
+	XLogRecPtr	lastSegSwitchLSN;
 
 	/*
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
@@ -884,6 +895,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetRecordFlags() for details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -896,7 +910,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -1013,6 +1029,18 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		 */
 		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
 							StartPos, EndPos);
+
+		/*
+		 * Unless record is flagged as not important, update LSN of last
+		 * important record in the current slot. When holding all locks, just
+		 * update the first one.
+		 */
+		if ((flags & XLOG_MARK_UNIMPORTANT) == 0)
+		{
+			int lockno = holdingAllLocks ? 0 : MyLockNo;
+
+			WALInsertLocks[lockno].l.lastImportantAt = StartPos;
+		}
 	}
 	else
 	{
@@ -2332,6 +2360,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					XLogArchiveNotifySeg(openLogSegNo);
 
 				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
 				/*
 				 * Request a checkpoint if we've consumed too much xlog since
@@ -4715,6 +4744,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7431,8 +7461,9 @@ StartupXLOG(void)
 	 */
 	InRecovery = false;
 
-	/* start the archive_timeout timer running */
+	/* start the archive_timeout timer and LSN running */
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -7994,16 +8025,51 @@ GetFlushRecPtr(void)
 }
 
 /*
- * Get the time of the last xlog segment switch
+ * GetLastImportantRecPtr -- Returns the LSN of the last important record
+ * inserted. All records not explicitly marked as unimportant are considered
+ * important.
+ *
+ * The LSN is determined by computing the maximum of
+ * WALInsertLocks[i].lastImportantAt.
+ */
+XLogRecPtr
+GetLastImportantRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	last_important;
+
+		/*
+		 * Need to take a lock to prevent torn reads of the LSN, which are
+		 * possible on some of the supported platforms. WAL insert locks only
+		 * support exclusive mode, so we have to use that.
+		 */
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		last_important = WALInsertLocks[i].l.lastImportantAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < last_important)
+			res = last_important;
+	}
+
+	return res;
+}
+
+/*
+ * Get the time and LSN of the last xlog segment switch
  */
 pg_time_t
-GetLastSegSwitchTime(void)
+GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 {
 	pg_time_t	result;
 
 	/* Need WALWriteLock, but shared lock is sufficient */
 	LWLockAcquire(WALWriteLock, LW_SHARED);
 	result = XLogCtl->lastSegSwitchTime;
+	*lastSwitchLSN = XLogCtl->lastSegSwitchLSN;
 	LWLockRelease(WALWriteLock);
 
 	return result;
@@ -8065,7 +8131,7 @@ ShutdownXLOG(int code, Datum arg)
 		 * record will go to the next XLOG file and won't be archived (yet).
 		 */
 		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch();
+			RequestXLogSwitch(false);
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
@@ -8253,7 +8319,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8334,38 +8400,33 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get location of last important record before acquiring insert locks (as
+	 * GetLastImportantRecPtr() also locks WAL locks).
+	 */
+	last_important_lsn = GetLastImportantRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
+	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (last_important_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
+			ereport(DEBUG1,
+					(errmsg("checkpoint skipped due to an idle system")));
 			return;
 		}
 	}
@@ -9122,12 +9183,15 @@ XLogPutNextOid(Oid nextOid)
  * write a switch record because we are already at segment start.
  */
 XLogRecPtr
-RequestXLogSwitch(void)
+RequestXLogSwitch(bool mark_unimportant)
 {
 	XLogRecPtr	RecPtr;
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	if (mark_unimportant)
+		XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
@@ -9997,7 +10061,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 		 * recovery case described above.
 		 */
 		if (!backup_started_in_recovery)
-			RequestXLogSwitch();
+			RequestXLogSwitch(false);
 
 		do
 		{
@@ -10582,7 +10646,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	 * Force a switch to a new xlog segment file, so that the backup is valid
 	 * as soon as archiver moves out the current segment file.
 	 */
-	RequestXLogSwitch();
+	RequestXLogSwitch(false);
 
 	XLByteToPrevSeg(stoppoint, _logSegNo);
 	XLogFileName(stopxlogfilename, ThisTimeLineID, _logSegNo);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 01cbd90f40..bc7253fc9b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -293,7 +293,7 @@ pg_switch_xlog(PG_FUNCTION_ARGS)
 				 errmsg("recovery is in progress"),
 				 errhint("WAL control functions cannot be executed during recovery.")));
 
-	switchpoint = RequestXLogSwitch();
+	switchpoint = RequestXLogSwitch(false);
 
 	/*
 	 * As a convenience, return the WAL location of the switch record
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b19f..24e35a3845 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* flags for the in-progress insertion */
+static uint8 curinsert_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	curinsert_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_MARK_UNIMPORTANT, to signal that the record is not important for
+ *   durability, which allows to avoid triggering WAL archiving and other
+ *   background activity.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetRecordFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+	curinsert_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a31d44e799..25020ab3b8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -310,7 +310,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -324,11 +324,11 @@ BackgroundWriterMain(void)
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if enough time has passed and interesting records have
+			 * been inserted since the last snapshot.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				last_snapshot_lsn < GetLastImportantRecPtr())
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 92b0a9416d..c875f40ece 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -573,15 +573,21 @@ CheckpointerMain(void)
 /*
  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
  *
- * This will switch to a new WAL file and force an archive file write
- * if any activity is recorded in the current WAL file, including just
- * a single checkpoint record.
+ * This will switch to a new WAL file and force an archive file write if
+ * meaningful activity is recorded in the current WAL file. This includes most
+ * writes, including just a single checkpoint record, but excludes WAL records
+ * that were inserted with the XLOG_MARK_UNIMPORTANT flag being set (like
+ * snapshots of running transactions).  Such records, depending on
+ * configuration, occur on regular intervals and don't contain important
+ * information.  This avoids generating archives with a few unimportant
+ * records.
  */
 static void
 CheckArchiveTimeout(void)
 {
 	pg_time_t	now;
 	pg_time_t	last_time;
+	XLogRecPtr	last_switch_lsn;
 
 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
 		return;
@@ -596,26 +602,33 @@ CheckArchiveTimeout(void)
 	 * Update local state ... note that last_xlog_switch_time is the last time
 	 * a switch was performed *or requested*.
 	 */
-	last_time = GetLastSegSwitchTime();
+	last_time = GetLastSegSwitchData(&last_switch_lsn);
 
 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
 
-	/* Now we can do the real check */
+	/* Now we can do the real checks */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when "important" WAL has been logged since the
+		 * last segment switch.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetLastImportantRecPtr() > last_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			/* mark switch as unimportant, avoids triggering checkpoints */
+			switchpoint = RequestXLogSwitch(true);
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f47c..2211a4b223 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec1eb..112fe07677 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -961,10 +961,11 @@ LogStandbySnapshot(void)
 /*
  * Record an enhanced snapshot of running transactions into WAL.
  *
- * The definitions of RunningTransactionsData and xl_xact_running_xacts
- * are similar. We keep them separate because xl_xact_running_xacts
- * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * The definitions of RunningTransactionsData and xl_xact_running_xacts are
+ * similar. We keep them separate because xl_xact_running_xacts is a
+ * contiguous chunk of memory and never exists fully until it is assembled in
+ * WAL. The inserted records are marked as not being important for durability,
+ * to avoid triggering superflous checkpoint / archiving activity.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -981,6 +982,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 
 	/* Header */
 	XLogBeginInsert();
+	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
 	XLogRegisterData((char *) (&xlrec), MinSizeOfXactRunningXacts);
 
 	/* array of TransactionIds */
@@ -1035,6 +1037,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c908..7d21408c4a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,13 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record being inserted, set using XLogSetRecordFlags().
+ */
+#define XLOG_INCLUDE_ORIGIN		0x01	/* include the replication origin */
+#define XLOG_MARK_UNIMPORTANT	0x02	/* record not important for durability */
+
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +218,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +271,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462098..05f996b127 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -283,8 +283,8 @@ extern const RmgrData RmgrTable[];
 /*
  * Exported to support xlog switching from checkpointer
  */
-extern pg_time_t GetLastSegSwitchTime(void);
-extern XLogRecPtr RequestXLogSwitch(void);
+extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
+extern XLogRecPtr RequestXLogSwitch(bool mark_uninmportant);
 
 extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177ef4e..307cfaaf47 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetRecordFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);
-- 
2.11.0.22.g8d7a455.dirty

#60

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Andres Freund (#59)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

Hi,

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:

I plan to commit this later today. Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#60)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:

I plan to commit this later today. Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Cool. Also, +1 for the important/unimportant terminology. I like that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Robert Haas (#61)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

On Fri, Dec 23, 2016 at 8:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:

I plan to commit this later today. Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Cool. Also, +1 for the important/unimportant terminology. I like that.

Thanks for the commit.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#62)

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

At Fri, 23 Dec 2016 11:02:11 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSsBzhOkCXyBh9_ZGUEnr0HCKRcpC9DMk6VVCGBez1pzA@mail.gmail.com>

On Fri, Dec 23, 2016 at 8:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:

I plan to commit this later today. Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Cool. Also, +1 for the important/unimportant terminology. I like that.

Thanks for the commit.

Thanks for commiting.

By the way this issue seems beeing in the ToDo list.

https://wiki.postgresql.org/wiki/Todo#Point-In-Time_Recovery_.28PITR.29

Consider avoiding WAL switching via archive_timeout if there
has been no database activity
- archive_timeout behavior for no activity
- Re: archive_timeout behavior for no activity

So I marked it as "done".

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers