Small computeRegionDelta optimization.

Started by Konstantin Knizhnikalmost 6 years ago1 messages
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru
1 attachment(s)

Hi hackers,

Playing with my undo-log storage, I found out that its performance is
mostly limited by generic WAL record mechanism,
and particularly by computeRegionDelta function which  computes page
delta for each logged operation.

I noticed that computeRegionDelta also becomes bottleneck in other cases
where generic WAL records are used, for example in RUM extension.
This is profile of inserting records in table with RUM index:

32.99%  postgres  postgres            [.] computeRegionDelta
   6.13%  postgres  rum.so              [.] updateItemIndexes
   4.61%  postgres  postgres            [.] hash_search_with_hash_value
   4.53%  postgres  postgres            [.] GenericXLogRegisterBuffer
   3.74%  postgres  rum.so              [.] rumTraverseLock
   3.33%  postgres  rum.so              [.] rumtuple_get_attrnum
   3.24%  postgres  rum.so              [.] dataPlaceToPage
   3.14%  postgres  postgres            [.] writeFragment
   2.99%  postgres  libc-2.23.so        [.] __memcpy_avx_unaligned
   2.81%  postgres  postgres            [.] nocache_index_getattr
   2.72%  postgres  rum.so              [.] rumPlaceToDataPageLeaf
   1.93%  postgres  postgres            [.] pg_comp_crc32c_sse42
   1.87%  postgres  rum.so              [.] findInLeafPage
   1.77%  postgres  postgres            [.] PinBuffer
   1.52%  postgres  rum.so              [.] compareRumItem
   1.49%  postgres  postgres            [.] FunctionCall2Coll
   1.34%  postgres  rum.so              [.] entryLocateEntry
   1.22%  postgres  libc-2.23.so        [.] __memcmp_sse4_1
   0.97%  postgres  postgres            [.] LWLockAttemptLock

I noticed that computeRegionDelta performs byte-by-byte comparison of page.
The obvious optimization is to compare words instead of bytes.
Small patch with such optimization is attached.
Definitely it may lead to small increase of produced deltas.
It is possible to calculate deltas more precisely: using work comparison
for raw  location of region and then locate precise boundaries using bye
comparisons.
But it complicates algorithm and so makes it slower/
In practice, taken in account that header of record in Postgres is 24
bytes long and fields are usually aligned on 4/8 bytes boundary,
I think that calculating deltas in words is preferable.

Results of such optimization:
Performance of my UNDAM storage is increased from 6500 TPS to 7000 TPS
(vs. 8500 for unlogged table),
and computeRegionDelta completely disappears from  RUM profile:

   9.37%  postgres  rum.so              [.] updateItemIndexes ▒
   6.57%  postgres  postgres            [.] GenericXLogRegisterBuffer ▒
   5.85%  postgres  postgres            [.] hash_search_with_hash_value ▒
   5.54%  postgres  rum.so              [.] rumTraverseLock ▒
   5.09%  postgres  rum.so              [.] dataPlaceToPage ▒
   4.85%  postgres  postgres            [.] computeRegionDelta ▒
   4.78%  postgres  rum.so              [.] rumtuple_get_attrnum ▒
   4.28%  postgres  postgres            [.] nocache_index_getattr ▒
   4.23%  postgres  rum.so              [.] rumPlaceToDataPageLeaf ▒
   3.39%  postgres  postgres            [.] pg_comp_crc32c_sse42 ▒
   3.16%  postgres  libc-2.23.so        [.] __memcpy_avx_unaligned ▒
   2.72%  postgres  rum.so              [.] findInLeafPage ▒
   2.64%  postgres  postgres            [.] PinBuffer ▒
   2.22%  postgres  postgres            [.] FunctionCall2Coll ▒
   2.22%  postgres  rum.so              [.] compareRumItem ▒
   1.91%  postgres  rum.so              [.] entryLocateEntry ▒

But... time of RUN insertion almost not changed: 1770 seconds vs. 1881
seconds.
Looks like it was mostly limited by time of writing data to the disk.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

generic_wal.patchtext/x-patch; name=generic_wal.patchDownload
diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c..59d5466 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -117,7 +117,7 @@ writeFragment(PageData *pageData, OffsetNumber offset, OffsetNumber length,
  */
 static void
 computeRegionDelta(PageData *pageData,
-				   const char *curpage, const char *targetpage,
+				   const char *currPage, const char *targetPage,
 				   int targetStart, int targetEnd,
 				   int validStart, int validEnd)
 {
@@ -125,6 +125,13 @@ computeRegionDelta(PageData *pageData,
 				loopEnd,
 				fragmentBegin = -1,
 				fragmentEnd = -1;
+	int64* curpage = (int64*)currPage;
+	int64* targetpage = (int64*)targetPage;
+
+	targetStart >>= 3;
+	validStart >>= 3;
+	targetEnd = (targetEnd + 7) >> 3;
+	validEnd = (validEnd + 7) >> 3;
 
 	/* Deal with any invalid start region by including it in first fragment */
 	if (validStart > targetStart)
@@ -189,11 +196,11 @@ computeRegionDelta(PageData *pageData,
 		 * fragmentEnd value, which is why it's OK that we unconditionally
 		 * assign "fragmentEnd = i" above.
 		 */
-		if (fragmentBegin >= 0 && i - fragmentEnd > MATCH_THRESHOLD)
+		if (fragmentBegin >= 0 && (i - fragmentEnd)*8 > MATCH_THRESHOLD)
 		{
-			writeFragment(pageData, fragmentBegin,
-						  fragmentEnd - fragmentBegin,
-						  targetpage + fragmentBegin);
+			writeFragment(pageData, fragmentBegin*8,
+						  (fragmentEnd - fragmentBegin)*8,
+						  targetPage + fragmentBegin*8);
 			fragmentBegin = -1;
 			fragmentEnd = -1;	/* not really necessary */
 		}
@@ -212,9 +219,9 @@ computeRegionDelta(PageData *pageData,
 	{
 		if (fragmentEnd < 0)
 			fragmentEnd = targetEnd;
-		writeFragment(pageData, fragmentBegin,
-					  fragmentEnd - fragmentBegin,
-					  targetpage + fragmentBegin);
+		writeFragment(pageData, fragmentBegin*8,
+					  (fragmentEnd - fragmentBegin)*8,
+					  targetPage + fragmentBegin*8);
 	}
 }