hint bit cache v5

Started by Merlin Moncureover 14 years ago7 messages
#1Merlin Moncure
mmoncure@gmail.com
1 attachment(s)

Attached is an updated version of the 'hint bit cache'.

What's changed:
*) 'bucket' was changed to 'page' everywhere
*) rollup array is now gets added during 'set', not the 'get' (pretty
dumb the way it was before -- wasn't really dealing with non-commit
bits yet)
*) more source comments, including a description of the cache in the intro
*) now caching 'invalid' bits.

I went back and forth several times whether to store invalid bits in
the same cache, a separate cache, or not at all. I finally settled
upon storing them in the same cache which has some pros and cons. It
makes it more or less exactly like the clog cache (so I could
copy/pasto some code out from there), but adds some overhead because 2
bit addressing is more expensive than 1 bit addressing -- this is
showing up in profiling...i'm upping the estimate of cpu bound scan
overhead from 1% to 2%. Still fairly cheap, but i'm running into the
edge of where I can claim the cache is 'free' for most workloads --
any claim is worthless without real world testing though. Of course,
if tuple hint bits are set or PD_ALL_VISIBLE is set, you don't have to
pay that price.

What's not:
*) Haven't touched any satisfies routine besides
HeapTupleSatisfiesMVCC (should they be?)
*) Haven't pushed the cache data into CacheMemoryContext. I figure
this is the way to go, but requires extra 'if' on every cache 'get'.
*) Didn't abstract the clog bit addressing macros. I'm leaning on not
doing this, but maybe they should be. My reasoning is that there is
no requirement for hint bit cache that pages should be whatever block
size is, and I'd like to reserve the ability to adjust the cache page
size independently.

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is. It's getting close to the point
where I can just post it to the commit fest for review. In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

merlin

Attachments:

hbache_v5.patchapplication/octet-stream; name=hbache_v5.patchDownload
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
old mode 100644
new mode 100755
index 2fd32e0..d734626
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -7,12 +7,27 @@
  * "hint" status bits if we see that the inserting or deleting transaction
  * has now committed or aborted (and it is safe to set the hint bits).
  * If the hint bits are changed, SetBufferCommitInfoNeedsSave is called on
  * the passed-in buffer.  The caller must hold not only a pin, but at least
  * shared buffer content lock on the buffer containing the tuple.
  *
+ * Hint bits are maintained directly on the tuple itself and on a separate
+ * in-memory cache.  If the hint bit is found from the cache, the hint bit
+ * is set on the tuple but the page is not marked dirty.  This is done to
+ * reduce i/o resulting from hint bit only changes.  The cache itself is
+ * exactly like the clog page, and uses a similar page size, but does not
+ * have to.  Cache maintenance occurs every Nth (typically 100) cache miss
+ * that successfully fetches the transaction status from clog.
+ *
+ * It's tempting to try and expand the simple last transaction status in
+ * transam.c but this check happens too late in the critical visibility
+ * routines to provide much benefit.  For the fastest possible cache
+ * performance with the least amount of impact on scan heavy cases, you have
+ * to maintain the cache here if you want to avoid dirtying the page
+ * based on interaction with the cache.
+ *
  * NOTE: must check TransactionIdIsInProgress (which looks in PGPROC array)
  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
  * pg_clog).  Otherwise we have a race condition: we might decide that a
  * just-committed transaction crashed, because none of the tests succeed.
  * xact.c is careful to record commit/abort in pg_clog before it unsets
  * MyProc->xid in PGPROC array.  That fixes that problem, but it also
@@ -69,26 +84,297 @@
 /* Static variables representing various special snapshot semantics */
 SnapshotData SnapshotNowData = {HeapTupleSatisfiesNow};
 SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
 
+/* hint bit cache definitions and structures */
+
+#define HBCACHE_PAGESZ 8192
+/* 8192 bytes of stororage in each hint bit cache page */
+#define HBCACHE_PAGE_COUNT 4		/* how many pages to keep */
+#define HBCACHE_ROLLUPSZ 100
+/* 100 cache misses are stored in array before re-filling cache */
+#define HBCACHE_HIT_THRESHOLD 5
+/* cache pagees with less than 5 interesting transactions per cache re-fill
+ * are not considered interesting
+ */
+
+#define HBCACHE_BITS_PER_XACT	2
+#define HBCACHE_XACTS_PER_BYTE	4
+#define HBCACHE_XACTS_PER_PAGE	(HBCACHE_PAGESZ * HBCACHE_XACTS_PER_BYTE)
+#define HBCACHE_XACT_BITMASK	((1 << HBCACHE_BITS_PER_XACT) - 1)
+
+#define HBCACHE_COMITTED 1
+#define HBCACHE_INVALID  2
+
+#define HBCache_TransactionIdToPage(xid)	((xid) / (TransactionId) HBCACHE_XACTS_PER_PAGE)
+#define HBCache_TransactionIdToPgIndex(xid) ((xid) % (TransactionId) HBCACHE_XACTS_PER_PAGE)
+#define HBCache_TransactionIdToByte(xid)	(HBCache_TransactionIdToPgIndex(xid) / HBCACHE_XACTS_PER_BYTE)
+#define HBCache_TransactionIdToBIndex(xid)	((xid) % (TransactionId) HBCACHE_XACTS_PER_BYTE)
+
+
+/* describes a hint bit page in memory */
+typedef struct
+{
+	int XidPage;  /* this page describes a range of transaction ids */
+	int HitCount;	/* number of cache hits on this page */
+	bool Clear;
+	/* during rollup, signals the page is changing and cache data needs to
+	 * be cleared
+	 */
+} HintBitCachePageHeader;
+
 /* local functions */
 static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
+static HintBitCachePageHeader HintbitCachePageList[HBCACHE_PAGE_COUNT]
+	= {{-1, 0, false}, {-1, 0, false}, {-1, 0, false}, {-1, 0, false}};
+static char HintBitCachePageList[HBCACHE_PAGE_COUNT][HBCACHE_PAGESZ];
+static int HintBitCachePageRollupList[HBCACHE_ROLLUPSZ];
+static int HintBitCacheMissCount = 0;
+
+/* qsort comparison function */
+static int
+int_cmp(const void *p1, const void *p2)
+{
+	int v1 = *((const int *) p1);
+	int v2 = *((const int *) p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
+/*
+ * RollUpHintBitCache()
+ *
+ * Whenever a transaction is checked against the hint bit cache and the status
+ * is not there (a cache miss), the transaction's page is added to the
+ * rollup array. When that array fills, the rollup array is processed into
+ * a new set of cache pages.
+ *
+ * The algorithm is to sort the rollup array and iterate it counting the
+ * number of identical pages as we go.  When we hit a new page (or the
+ * end of the array) and the number of counted pages is over a minimum
+ * threshold, the page is moved into the cache if it has more cache hits
+ * than any page that is already in the cache.
+ */
+static void RollUpHintBitCache()
+{
+	TransactionId LastPage = HintBitCachePageRollupList[0];
+	/* The page we are duplicate tracking. */
+
+	int PageIdx;
+	int HitCount = 0;
+	/* effectively initializes to 1: the first iteration of the loop will
+	 * always match LastPage and increment this value
+	 */
+
+	int Page;
+	int PageMinHits = HBCACHE_ROLLUPSZ + 1;
+	/* The smallest number of hits for a page currently in the cache. It's
+	 * scanned comparing as we go, so initialize it to an impossibly high
+	 * value here */
+
+	int PageMinIdx;
+	/* The cache page index of the page with the smallest number of hits.
+	 * if a new cache page higher number of hits with this one, it is
+	 * swapped in its place.
+	 */
+	int XidPage;
+
+	bool AlreadyInCache;
+
+	/* sort the list in order to make counting identical pages easier */
+	qsort(HintBitCachePageRollupList, HBCACHE_ROLLUPSZ, sizeof(int), int_cmp);
+
+	for (PageIdx = 0; PageIdx < HBCACHE_ROLLUPSZ; PageIdx++)
+	{
+		/* If page is the same as the last one, increment the hitcount
+		 * or reset it and do cache management based on the number of hits
+		 */
+		XidPage = HintBitCachePageRollupList[PageIdx];
+
+		if(XidPage == LastPage)
+		        HitCount++;
+		else
+		{
+			/* Now check the page represented by LastPage to see if its
+			 * interesting for cache purposes.  To avoid thrashing cache pages
+			 * in and out, a minimum threshold of hits has to be met before
+			 * putting a page in the cache.
+			 */
+			if (HitCount >= HBCACHE_HIT_THRESHOLD)
+			{
+CheckLastXidBlock:
+				/* Look through the pages, find the one with the smallest
+				 * hit count, and save off the index.
+				 */
+				AlreadyInCache = false;
+
+				for (Page = 0; Page < HBCACHE_PAGE_COUNT; Page++)
+				{
+					if (HintbitCachePageList[Page].HitCount < PageMinHits)
+					{
+						PageMinIdx = Page;
+						PageMinHits = HintbitCachePageList[Page].HitCount;
+					}
+
+					/* If this page is alredy in the cache, we can leave it
+					 * an without clearing it.
+					 */
+					if (HintbitCachePageList[Page].XidPage == LastPage)
+						AlreadyInCache = true;
+				}
+
+				/* does the page we are examining have a higher number of
+				 * cache hits than any page currently in the cache?
+				 */
+				if (!AlreadyInCache && HitCount > PageMinHits)
+				{
+					/* mark the cache buffer with the new page, hint count
+					 * and clear flag.
+					 */
+					HintbitCachePageList[PageMinIdx].XidPage = LastPage;
+					HintbitCachePageList[PageMinIdx].HitCount = HitCount;
+					HintbitCachePageList[PageMinIdx].Clear = true;
+				}
+			}
+
+			/* reset the hit count, since we've already technically already
+			 * seen this page in this iteration, HitCount is initialized
+			 * to 1, and this page is now the one tracked via LastPage.
+			 */
+			HitCount = 1;
+			LastPage = XidPage;
+		}
+	}
+
+	/* When the loop is exited, the last block of identical pages may have
+	 * never gotten checked via the XidPage == LastPage condition.
+	 * Sneakily jump back into the loop if necessary.
+	 *
+	 * jump back into the loop to check it but decrement PageIdx first so
+	 * that the final LastPage assignment does't look outside the array.
+	 */
+	if(HitCount >= HBCACHE_HIT_THRESHOLD)
+	{
+		/* Since HitCount is reset when a page is processed, checking it
+		 * is enough to guarantee the last page in the loop was never
+		 * processed.
+		 */
+		goto CheckLastXidBlock;
+	}
+
+	/* Cache pages that are new have to be cleared.  Look for ones marked as
+	 * such and reset the hit count for all pages.
+	 */
+	for (Page = 0; Page < HBCACHE_PAGE_COUNT; Page++)
+	{
+		if(HintbitCachePageList[PageMinIdx].Clear)
+			memset(HintBitCachePageList[Page], 0, HBCACHE_PAGESZ);
+
+		HintbitCachePageList[PageMinIdx].Clear = 0;
+		HintbitCachePageList[PageMinIdx].HitCount = 0;
+	}
+
+	HintBitCacheMissCount = 0;
+	return;
+}
+
+/*
+ * HintBitCacheXidStatus()
+ *
+ * Compute the page for a TransactionId, and compare it against the pages
+ * currently in the cache.  If the page is found, resolve the hint bit
+ * status's position in the block and return it.
+ */
+static inline int
+HintBitCacheXidStatus(TransactionId xid)
+{
+	int XidPage = HBCache_TransactionIdToPage(xid);
+	int Page;
+
+	for(Page = 0; Page < HBCACHE_PAGE_COUNT; Page++)
+	{
+		/* is the transaction status in the range defined by the page? */
+		if(HintbitCachePageList[Page].XidPage == XidPage)
+		{
+			int	byteno = HBCache_TransactionIdToByte(xid);
+			int bshift = HBCache_TransactionIdToBIndex(xid) * HBCACHE_BITS_PER_XACT;
+			char *byteptr = HintBitCachePageList[Page] + byteno;
+
+			return (*byteptr >> bshift) & HBCACHE_XACT_BITMASK;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * SetXidInHintBitCache()
+ *
+ * Compute the page for a TransactionId, and compare it against the pages
+ * currently in the cache.  If the page is found, compute the bit position
+ * and set it to 1 (true).  If it is not found, put the TransactionId in the
+ * array that holds cache misses.  If that array fills, rebuild the cache.
+ */
+static inline void
+SetXidInHintBitCache(TransactionId xid, int hbcache_status)
+{
+	int XidPage = HBCache_TransactionIdToPage(xid);
+	int Page;
+
+	for(Page = 0; Page < HBCACHE_PAGE_COUNT; Page++)
+	{
+	    if(XidPage == HintbitCachePageList[Page].XidPage)
+	    {
+			int	byteno = HBCache_TransactionIdToByte(xid);
+			int bshift = HBCache_TransactionIdToBIndex(xid) * HBCACHE_BITS_PER_XACT;
+			char *byteptr = HintBitCachePageList[Page] + byteno;
+			char byteval = *byteptr;
+
+			byteval &= ~(((1 << HBCACHE_BITS_PER_XACT) - 1) << bshift);
+			byteval |= (hbcache_status << bshift);
+			*byteptr = byteval;
+
+			/* there is a minute chance of overflow here, but it doesn't
+			 * seem worthwhile to test for it
+			 */
+			HintbitCachePageList[Page].HitCount++;
+			break;
+		}
+	}
+
+	HintBitCachePageRollupList[HintBitCacheMissCount++] = XidPage;
+
+	/* Enough transactions have fallen through the cache that it's time to
+	 * rebuild it.  Hopefully we don't have to do this too often
+	 */
+	if (HintBitCacheMissCount == HBCACHE_ROLLUPSZ)
+		RollUpHintBitCache();
+
+	return;
+}
+
 
 /*
  * SetHintBits()
  *
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
  * It is only safe to set a transaction-committed hint bit if we know the
  * transaction's commit record has been flushed to disk.  We cannot change
  * the LSN of the page here because we may hold only a share lock on the
  * buffer, so we can't use the LSN to interlock this; we have to just refrain
  * from setting the hint bit until some future re-examination of the tuple.
+ * The ability of the hint bit to be set is returned to the caller (the hint
+ * bit cache only saves safe bits in the cache to avoid repeated checks).
  *
  * We can always set hint bits when marking a transaction aborted.	(Some
  * code in heapam.c relies on that!)
  *
  * Also, if we are cleaning up HEAP_MOVED_IN or HEAP_MOVED_OFF entries, then
  * we can always set the hint bits, since pre-9.0 VACUUM FULL always used
@@ -101,27 +387,35 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
  * Normal commits may be asynchronous, so for those we need to get the LSN
  * of the transaction and then check whether this is flushed.
  *
  * The caller should pass xid as the XID of the transaction to check, or
  * InvalidTransactionId if no check is needed.
  */
-static inline void
+static inline bool
 SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 			uint16 infomask, TransactionId xid)
 {
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here! */
 		XLogRecPtr	commitLSN = TransactionIdGetCommitLSN(xid);
 
 		if (XLogNeedsFlush(commitLSN))
-			return;				/* not flushed yet, so don't set hint */
+			return false;			/* not flushed yet, so don't set hint */
 	}
 
 	tuple->t_infomask |= infomask;
 	SetBufferCommitInfoNeedsSave(buffer);
+	return true;
+}
+
+static inline void
+SetHintBitsCache(HeapTupleHeader tuple, Buffer buffer,
+				uint16 infomask, TransactionId xid)
+{
+	tuple->t_infomask |= infomask;
 }
 
 /*
  * HeapTupleSetHintBits --- exported version of SetHintBits()
  *
  * This must be separate because of C99's brain-dead notions about how to
@@ -911,91 +1205,127 @@ HeapTupleSatisfiesDirty(HeapTupleHeader tuple, Snapshot snapshot,
 bool
 HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
 					   Buffer buffer)
 {
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
+		int hbcache_status;
+
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+		hbcache_status = HintBitCacheXidStatus(HeapTupleHeaderGetXmin(tuple));
 
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
+		if (hbcache_status == HBCACHE_COMITTED)
+		{
+			/* The hint bit was not set on the tuple, but it was set in the
+			 * cache.  Since cache bits are only set if the transaction's
+			 * commit record has been flushed to disk, we can call a quicker
+			 * version of SetHintBits that does not check the commit record
+			 * and, because the page is not marked dirty, does not have to
+			 * lock the buffer.
+			 */
+	        SetHintBitsCache(tuple, buffer, HEAP_XMIN_COMMITTED,
+                                HeapTupleHeaderGetXmin(tuple));
 		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
+		else if (hbcache_status == HBCACHE_INVALID)
 		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+	        SetHintBitsCache(tuple, buffer, HEAP_XMIN_INVALID,
+                                HeapTupleHeaderGetXmin(tuple));
 
-			if (!TransactionIdIsCurrentTransactionId(xvac))
+			return false;
+		}
+		else
+		{
+			/* Used by pre-9.0 binary upgrades */
+			if (tuple->t_infomask & HEAP_MOVED_OFF)
 			{
-				if (TransactionIdIsInProgress(xvac))
+				TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+
+				if (TransactionIdIsCurrentTransactionId(xvac))
 					return false;
-				if (TransactionIdDidCommit(xvac))
+				if (!TransactionIdIsInProgress(xvac))
+				{
+					if (TransactionIdDidCommit(xvac))
+					{
+						SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+									InvalidTransactionId);
+						return false;
+					}
 					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
 								InvalidTransactionId);
-				else
+				}
+			}
+			/* Used by pre-9.0 binary upgrades */
+			else if (tuple->t_infomask & HEAP_MOVED_IN)
+			{
+				TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
+
+				if (!TransactionIdIsCurrentTransactionId(xvac))
 				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
+					if (TransactionIdIsInProgress(xvac))
+						return false;
+					if (TransactionIdDidCommit(xvac))
+						SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+									InvalidTransactionId);
+					else
+					{
+						SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+									InvalidTransactionId);
+						return false;
+					}
 				}
 			}
-		}
-		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple)))
-		{
-			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
-				return false;	/* inserted after scan started */
+			else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple)))
+			{
+				if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
+					return false;	/* inserted after scan started */
 
-			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
-				return true;
+				if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
+					return true;
 
-			if (tuple->t_infomask & HEAP_IS_LOCKED)		/* not deleter */
-				return true;
+				if (tuple->t_infomask & HEAP_IS_LOCKED)		/* not deleter */
+					return true;
 
-			Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
+				Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
 
-			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple)))
-			{
-				/* deleting subtransaction must have aborted */
-				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-							InvalidTransactionId);
-				return true;
+				if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tuple)))
+				{
+					/* deleting subtransaction must have aborted */
+					SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
+								InvalidTransactionId);
+					return true;
+				}
+
+				if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
+					return true;	/* deleted after scan started */
+				else
+					return false;	/* deleted before scan started */
 			}
+			else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
+				return false;
+			else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
+			{
+				/* Set the hint bit on the tuple, and put it in the cache
+				 * if it was actually set.
+				 */
+				if(SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+							HeapTupleHeaderGetXmin(tuple)))
+					SetXidInHintBitCache(HeapTupleHeaderGetXmin(tuple),
+								HBCACHE_COMITTED);
 
-			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
-				return true;	/* deleted after scan started */
+			}
 			else
-				return false;	/* deleted before scan started */
-		}
-		else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
-			return false;
-		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						HeapTupleHeaderGetXmin(tuple));
-		else
-		{
-			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-						InvalidTransactionId);
-			return false;
+			{
+				/* it must have aborted or crashed */
+				if(SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+							InvalidTransactionId))
+					SetXidInHintBitCache(HeapTupleHeaderGetXmin(tuple),
+								HBCACHE_INVALID);
+				return false;
+			}
 		}
 	}
 
 	/*
 	 * By here, the inserting transaction has committed - have to check
 	 * when...
#2Simon Riggs
simon@2ndquadrant.com
In reply to: Merlin Moncure (#1)
Re: hint bit cache v5

On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is.  It's getting close to the point
where I can just post it to the commit fest for review.  In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

I'm interested in what you're doing here.

From here, there's quite a lot of tuning possibilities. It would be
very useful to be able to define some metrics we are interested in
reducing and working out how to measure them.

That way we can compare all the different variants of this to see
which way of doing things works best.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#2)
Re: hint bit cache v5

On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is.  It's getting close to the point
where I can just post it to the commit fest for review.  In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

I'm interested in what you're doing here.

From here, there's quite a lot of tuning possibilities. It would be
very useful to be able to define some metrics we are interested in
reducing and working out how to measure them.

That way we can compare all the different variants of this to see
which way of doing things works best.

thanks for that! I settled on this approach because the downside
cases should hopefully be pretty limited. The upside is a matter of
debate although fairly trivial to demonstrate synthetically.

I'm looking for some way of benchmarking the benefits in non-simulated
fashion. We desperately need something like a performance farm (as
many many others have mentioned).

merlin

#4Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#2)
2 attachment(s)
Re: hint bit cache v5

On Tue, May 10, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is.  It's getting close to the point
where I can just post it to the commit fest for review.  In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

I'm interested in what you're doing here.

From here, there's quite a lot of tuning possibilities. It would be
very useful to be able to define some metrics we are interested in
reducing and working out how to measure them.

Following are results that are fairly typical of the benefits you
might see when the optimization kicks in. The attached benchmark just
creates a bunch of records in a random table and scans it. This is
more or less the scenario that causes people to grip about hint bit
i/o, especially in systems that are already under moderate to heavy
i/o stress. I'm gonna call it for 20%, although it could be less if
you have an i/o system that spanks the test (try cranking -c and the
creation # records in bench.sql in that case). Anecdotal reports of
extreme duress caused by hint bit i/o suggest problematic or mixed use
(OLTP + OLAP) workloads might see even more benefit. One thing I need
to test is how much benefit you'll see with wider records.

I think I'm gonna revert the change to cache invalid bits. I just
don't see hint bits as a major contributor to dead tuples following
epic rollbacks (really, the solution for that case is simply to try
and not get in that scenario if you can). This will put the code back
into the cheaper and simpler bit per transaction addressing. What I
do plan to do though, is to check and set xmax commit bits in the
cache...that way deleted tuples will see cache benefits.

[hbcache]
merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 200 s
number of transactions actually processed: 8
tps = 0.037167 (including connections establishing)
tps = 0.037171 (excluding connections establishing)

real 3m35.549s
user 0m0.008s
sys 0m0.004s

[HEAD]
merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 200 s
number of transactions actually processed: 8
tps = 0.030313 (including connections establishing)
tps = 0.030317 (excluding connections establishing)

real 4m24.216s
user 0m0.000s
sys 0m0.012s

Attachments:

bench.sqlapplication/octet-stream; name=bench.sqlDownload
bench_setup.sqlapplication/octet-stream; name=bench_setup.sqlDownload
#5Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#4)
Re: hint bit cache v5

On Wed, May 11, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

One thing I need to test is how much benefit you'll see with wider records.

The results are a bit better, around 25% using a similar methodology
on ~ 1k wide records.

I think I'm gonna revert the change to cache invalid bits. I just
don't see hint bits as a major contributor to dead tuples following
epic rollbacks

what I meant to say here was, I don't see hint bit i/o following
rollbacks as a major issue. Point being, I don't see much use in
optimizing management of INVALID tuple bits beyond what is already
done.

Anyways, demonstrating a 'good' case is obviously not the whole story.
But what are the downsides? There are basically two:

1) tiny cpu penalty on every heap fetch
2) extremely widely dispersed (in terms of transaction id) unhinted
tuples can force the cache to refresh every 100 tuples in the absolute
worst case. A cache refresh is a 100 int sort and a loop.

For '1', the absolute worst case I can come up with, cpu bound scans
of extremely narrow records, the overall cpu usage goes up around 1%.
'2' seems just impossible to see in the real world -- and if it does,
you are also paying for lots of clog lookups all the way through the
slru, and you are having i/o and other issues on top of it. Even if
all the stars align and it does happen, all the tuples get hinted and
dirtied anyways so it will only happen at most once on that particular
set of data.

merlin

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Merlin Moncure (#4)
Re: hint bit cache v5

On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

Following are results that are fairly typical of the benefits you
might see when the optimization kicks in.  The attached benchmark just

[hbcache]
real    3m35.549s

[HEAD]
real    4m24.216s

These numbers look very good. Thanks for responding to my request.

What people have said historically at this point is "ah, but you've
just deferred the pain from clog lookups".

The best way to show this does what we hope is to run a normal-ish
OLTP access to the table that would normally thrash the clog and show
no ill effects there either. Run that immediately after the above
tests so that the cache and hint bits are both primed.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#7Merlin Moncure
mmoncure@gmail.com
In reply to: Simon Riggs (#6)
Re: hint bit cache v5

On Wed, May 11, 2011 at 12:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

Following are results that are fairly typical of the benefits you
might see when the optimization kicks in.  The attached benchmark just

[hbcache]
real    3m35.549s

[HEAD]
real    4m24.216s

These numbers look very good. Thanks for responding to my request.

What people have said historically at this point is "ah, but you've
just deferred the pain from clog lookups".

Deferred, or eliminated. If any tuple on the page gets updated,
deleted, etc or the the table itself is dropped then you've
'won'...the page with rhw hint bit only change was never booted out to
the heap before another substantive change happened. This is exactly
what happens in certain common workloads -- you insert a bunch of
records, scan them with some batch process, then delete them. Let's
say a million records were inserted under a single transaction and you
are getting bounced out of the transam.c cache, you just made a
million calls to TransactionIdIsCurrentTransactionId and (especially)
TransactionIdIsInProgress for the *exact same* transaction_id, over
and over. That stuff adds up even before looking at the i/o incurred.

Put another way, the tuple hint bits have a lot of usefulness when the
tuples on the page are coming from all kinds of differently aged
transactions. When all the tuples have the same or similar xid, the
information value is quite low, and the i/o price isn't worth it. The
cache neatly haircuts the downside case. If the cache isn't helping
(any tuple fetch on the page faults through it), the page is dirtied
and the next time it's fetched all the bits will be set.

The best way to show this does what we hope is to run a normal-ish
OLTP access to the table that would normally thrash the clog and show
no ill effects there either. Run that immediately after the above
tests so that the cache and hint bits are both primed.

yeah. the only easy way I know of to do this extremely long pgbench
runs, and getting good results is harder than it sounds...if the tuple
hint bits make it to disk (via vacuum or a cache fault), they stay
there and that tuple is no longer interesting from the cache point of
view.

If you make the scale really large the test will just take forever
just to get the tables primed (like a week). Keep in mind, autovacuum
can roll around at any time and set the bits under you (you can of
course disable it, but who really does than on OLTP?). Small scale
oltp tests are not real world realistic because anybody sane would
just let autovacuum loose on the table. clog thrashing systems are
typically mature, high load oltp databases...not fun to test on your
single 7200 rpm drive.

I'm going to boldly predict that with all the i/o flying around in
cases like that, the paltry cpu cycles spent dealing with the cache
are the least of your problems. Not discounting the need to verify
that though.

merlin