B-tree cache prefetches

Started by Andrey Borodinover 7 years ago9 messages

x4mmm@yandex-team.ru

over 7 years ago

Hi hackers!

I've been at the database conference and here everyone is talking about cache prefetches.

I've tried simple hack

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..ffddf553aa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,

/* We have low <= mid < high, so mid points at a real slot */

+               OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+               if (x < high)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+               x = low + ((mid - low) / 2);
+               if (x > low)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+
                result = _bt_compare(rel, keysz, scankey, page, mid);

if (result >= cmpval)

The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possible ways of binary search.
And it seems to me that on a simple query

insert into x select (random()*1000000)::int from generate_series(1,1e7);

it brings something like 2-4% of performance improvement on my laptop.

Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

Best regards, Andrey Borodin.

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Andrey Borodin (#1)

Re: B-tree cache prefetches

On Thu, Aug 30, 2018 at 10:53 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possible ways of binary search.
And it seems to me that on a simple query

insert into x select (random()*1000000)::int from generate_series(1,1e7);

it brings something like 2-4% of performance improvement on my laptop.

Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

I once wrote a patch that used __builtin_prefetch() when fetching
tuples from a tuplesort. It worked reasonably well on my laptop, but
didn't seem to do much on another machine with another
microarchitecture (presumably the server with the alternative
microarchitecture had superior hardware prefetching). The conclusion
was that it wasn't really worth pursuing.

I'm not dismissing your idea. I'm just pointing out that the burden of
proving that explicit prefetching is a good idea is rather
significant. We especially want to avoid something that needs to be
reassessed every couple of years.

--
Peter Geoghegan

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Andrey Borodin (#1)

Re: B-tree cache prefetches

On Fri, Aug 31, 2018 at 5:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I've been at the database conference and here everyone is talking about cache prefetches.

I've tried simple hack
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..ffddf553aa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,
/* We have low <= mid < high, so mid points at a real slot */
+               OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+               if (x < high)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+               x = low + ((mid - low) / 2);
+               if (x > low)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+
result = _bt_compare(rel, keysz, scankey, page, mid);
if (result >= cmpval)

The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possible ways of binary search.
And it seems to me that on a simple query

insert into x select (random()*1000000)::int from generate_series(1,1e7);

it brings something like 2-4% of performance improvement on my laptop.

Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1]https://arxiv.org/pdf/1509.05053.pdf if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed about when I was (casually) investigating new layouts
based on some comments from a fellow hacker (I can't remember if it
was Andres or Peter G who mentioned this topic to me). However, the
paper is talking about "branch-free binary search with explicit
prefetch", which apparently eats plain old branchy binary search for
breakfast, and I gather they tested on a lot of hardware. That sounds
interesting to look into.

[1]: https://arxiv.org/pdf/1509.05053.pdf

--
Thomas Munro
http://www.enterprisedb.com

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Thomas Munro (#3)

Re: B-tree cache prefetches

On Thu, Aug 30, 2018 at 2:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed about when I was (casually) investigating new layouts
based on some comments from a fellow hacker (I can't remember if it
was Andres or Peter G who mentioned this topic to me).

It might have been both of us, at different times.

However, the
paper is talking about "branch-free binary search with explicit
prefetch", which apparently eats plain old branchy binary search for
breakfast, and I gather they tested on a lot of hardware. That sounds
interesting to look into.

I think that the main win will come from stacking a bunch of different
optimizations on top of each other, including key normalization,
automatic prefix compression on internal pages (something that's a lot
easier than on leaf pages [1]https://wiki.postgresql.org/wiki/Key_normalization#Using_existing_.22minus_infinity.22_index_tuple_as_a_low_key), and abbreviated keys in the ItemId
array of nbtree pages [2]/messages/by-id/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com (even 15-bit abbreviated keys can have a lot
of entropy when combined with these other techniques). Once we have
all that in place, I expect drastically fewer iCache misses, making
other optimizations interesting.

I'm bullish on the idea of using something like an unrolled binary
search, or an interpolation search in the long term. In the meantime,
we have too many iCache misses. As a datapoint, iCache misses
reportedly dominate the overhead of TPC-C and TPC-E [3]https://openproceedings.org/2013/conf/edbt/TozunPKJA13.pdf -- "6.1.2 Core stalls" -- Peter Geoghegan. Nestloop
joins with inner index scans are dominant within those benchmarks.

[1]: https://wiki.postgresql.org/wiki/Key_normalization#Using_existing_.22minus_infinity.22_index_tuple_as_a_low_key
[2]: /messages/by-id/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com
[3]: https://openproceedings.org/2013/conf/edbt/TozunPKJA13.pdf -- "6.1.2 Core stalls" -- Peter Geoghegan
"6.1.2 Core stalls"
--
Peter Geoghegan

Andrey Borodin

x4mmm@yandex-team.ru

about 7 years ago

In reply to: Thomas Munro (#3)

Re: B-tree cache prefetches

Hello!

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed

I've re-read that paper. Their results are not about our case, here's a quote:

For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastest algorithm

Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout.

Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page.

Best regards, Andrey Borodin.

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Andrey Borodin (#5)

Re: B-tree cache prefetches

On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed

I've re-read that paper. Their results are not about our case, here's a quote:

For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastest algorithm

Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout.

I don't know. Aren't we talking about binary search within one 8KB
page here, anyway?

Thinking about some other ideas for cache prefetching in btree code,
here's an idea to improve pipelining of index scans (a bit like the
two-step pipeline idea from the hash join prefetch thread). We know
the TIDs of heap tuples we'll be looking up in the near future, so can
we prefetch all the way to the tuple? There are three random
accesses that follow soon after reading an index tuple:
BufTableLookup(), then page header, then the tuple itself. At the
logical extreme, we could anticipate the probe of SharedBufHash for
the TID from index tuple i + 3 (perhaps dynahash could expose a
hash_prefetch_for_hash() facility), and anticipate the read of the
page header for the tuple pointed to by index tuple i + 2 (perhaps a
dirty read of SharedBufHash that is only prepared to look at the first
entry in the bucket would be OK for this), and anticipate the read of
the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
the page item table). Or some less ambitious subset, if that pipeline
turns out to be too deep; it seems like the simplest experiment would
be to try prefetching just SharedBufHash lookups. This may all be
completely crazy, I haven't tried any of it.

Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page.

Based on the the above quotes, I'm not suggesting we should use
Eytzinger for search-within-page. But out of curiosity, I checked,
and saw that you can actually get a pair of index tuples holding a
six-character text key or an integer key into a 64 byte cache line.

--
Thomas Munro
http://www.enterprisedb.com

Andrey Borodin

x4mmm@yandex-team.ru

about 7 years ago

In reply to: Thomas Munro (#6)

Re: B-tree cache prefetches

15 окт. 2018 г., в 2:38, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed

I've re-read that paper. Their results are not about our case, here's a quote:

For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastest algorithm

Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout.

I don't know. Aren't we talking about binary search within one 8KB
page here, anyway?

The paper discuss multiple searches inside one 8Kb region, while we are doing single search in all-uncached page and move away. That's the difference.

Binserach is optimal, if the page is in L1 before we start search.

Thinking about some other ideas for cache prefetching in btree code,
here's an idea to improve pipelining of index scans (a bit like the
two-step pipeline idea from the hash join prefetch thread). We know
the TIDs of heap tuples we'll be looking up in the near future, so can
we prefetch all the way to the tuple? There are three random
accesses that follow soon after reading an index tuple:
BufTableLookup(), then page header, then the tuple itself. At the
logical extreme, we could anticipate the probe of SharedBufHash for
the TID from index tuple i + 3 (perhaps dynahash could expose a
hash_prefetch_for_hash() facility), and anticipate the read of the
page header for the tuple pointed to by index tuple i + 2 (perhaps a
dirty read of SharedBufHash that is only prepared to look at the first
entry in the bucket would be OK for this), and anticipate the read of
the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
the page item table). Or some less ambitious subset, if that pipeline
turns out to be too deep; it seems like the simplest experiment would
be to try prefetching just SharedBufHash lookups. This may all be
completely crazy, I haven't tried any of it.

This idea also looks good to me. One thing bothers me. if we do __builtin_prefetch(&a[b[c[d]]],0,1), and a, b, c and d all are not in cache - will we incur CPU wait time for fetching a,b and c?
This [0]http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf guys are expoiting C++ coroutines for prefetching this kind of stuff, but it seems like too much of hassle.

Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page.

Based on the the above quotes, I'm not suggesting we should use
Eytzinger for search-within-page. But out of curiosity, I checked,
and saw that you can actually get a pair of index tuples holding a
six-character text key or an integer key into a 64 byte cache line.

Well, this proves that in theory Eytzinger is an opportunity.

Best regards, Andrey Borodin.

[0]: http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf

Andrey Borodin

x4mmm@yandex-team.ru

about 7 years ago

In reply to: Thomas Munro (#3)

1 attachment(s)

Re: B-tree cache prefetches

Hi, Thomas!

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

[1] https://arxiv.org/pdf/1509.05053.pdf

I've implemented all of the strategies used in that paper.
On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at the end are mostly "appended", i.e. they usually are accommodated at the end of the free space.
The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM and can be easily broken with single insert.
Strategies are switched by #define:
1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to work similar to baseline (no changes to code at all)
2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a sequential read of bytes.
3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that you cache-miss when read mid, but tuples of both next steps are already fetched by that cache miss.
4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache prefetch.

Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.

I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
No changes in configs.

Here are results in tps:
without prefetch
baseline 1448.331199
seq 1433.737204
bt 1463.701527
veb 1457.586089
eyz 1483.654765

with prefetch
baseline 1486.585895
seq 1471.757042
bt 1480.169967
veb 1464.834678
eyz 1460.323392

This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned anyway.
From this I can conclude:
1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative B-tree performance tests?
2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting 2.6%
3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best.
4. Among layout strategies with prefetch, bt-order strategy seems to be winner.

How can I improve experimental evaluation of these layout strategies?
Other thoughts? I'd be happy to hear any comment on this.

Best regards, Andrey Borodin.

Attachments:

0001-Implement-different-B-tree-page-layouts.patchapplication/octet-stream; name=0001-Implement-different-B-tree-page-layouts.patch; x-unix-mode=0644Download

From ddccbe9247161cc280e2eec058e3618d46fa2e58 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 6 Nov 2018 16:51:35 +0500
Subject: [PATCH] Implement different B-tree page layouts

---
 src/backend/access/nbtree/nbtree.c    | 212 ++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c |  19 +++
 src/backend/storage/page/bufpage.c    |  26 ++++
 src/include/storage/bufpage.h         |   1 +
 4 files changed, 258 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..67271df1a6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1089,6 +1089,210 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		*oldestBtpoXact = vstate.oldestBtpoXact;
 }
 
+static void
+_bt_sequential_layout(OffsetNumber* order, int* next, OffsetNumber low, OffsetNumber high)
+{
+	while (high > low)
+	{
+		order[*next] = low;
+		(*next)++;
+		low++;
+	}
+}
+
+static void
+_bt_metabtree_recursion(OffsetNumber* order, int* next, OffsetNumber low, OffsetNumber high)
+{
+	if (high > low)
+	{
+		OffsetNumber low1, high1, mid_low, mid_high;
+
+		OffsetNumber mid = low + ((high - low) / 2);
+		low1 = mid + 1;
+		high1 = mid;
+		mid_low = low + ((high1 - low) / 2);
+		mid_high = low1 + ((high - low1) / 2);
+		/* mid is already packed! */
+
+		if (mid_low != mid)
+		{
+			order[*next] = mid_low;
+			(*next)++;
+		}
+		if (mid_high != mid && mid_high != high)
+		{
+			order[*next] = mid_high;
+			(*next)++;
+		}
+
+		_bt_metabtree_recursion(order, next, low, high1);
+		_bt_metabtree_recursion(order, next, low1, high);
+
+		/* left here for reference
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;*/
+	}
+}
+
+static void
+_bt_metabtree_layout(OffsetNumber* order, int* next, OffsetNumber low, OffsetNumber high)
+{
+	if (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		order[*next] = mid;
+		(*next)++;
+
+		_bt_metabtree_recursion(order, next, low, high);
+
+		/* left here for reference
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;*/
+	}
+}
+
+static void
+_bt_eyzinger_recursion(OffsetNumber* order, int* next, OffsetNumber low, OffsetNumber high)
+{
+	if (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		order[*next] = mid;
+		(*next)++;
+
+		_bt_eyzinger_recursion(order, next, low, mid);
+		_bt_eyzinger_recursion(order, next, mid + 1, high);
+
+		/* left here for reference
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;*/
+	}
+}
+
+static void
+_bt_veb_recursion(OffsetNumber* order, int* next, OffsetNumber low, OffsetNumber high)
+{
+	if (high > low)
+	{
+		OffsetNumber low1, high1, mid_low, mid_high;
+
+		OffsetNumber mid = low + ((high - low) / 2);
+		low1 = mid + 1;
+		high1 = mid;
+		mid_low = low + ((high1 - low) / 2);
+		mid_high = low1 + ((high - low1) / 2);
+
+		order[*next] = mid;
+		(*next)++;
+		if (mid_low != mid)
+		{
+			order[*next] = mid_low;
+			(*next)++;
+		}
+		if (mid_high != high)
+		{
+			order[*next] = mid_high;
+			(*next)++;
+		}
+
+		/* We have low <= mid < high, so mid points at a real slot */		
+
+		_bt_veb_recursion(order, next, low, mid_low);
+		_bt_veb_recursion(order, next, mid_low + 1, mid);
+		_bt_veb_recursion(order, next, mid + 1, mid_high);
+		_bt_veb_recursion(order, next, mid_high + 1, high);
+
+		/* left here for reference
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;*/
+	}
+}
+
+static bool
+_bt_check_layout(Page	page, OffsetNumber* order)
+{
+	char busy[MaxOffsetNumber];
+	OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+	for (int i = 0; i <= maxoff; i++)
+	{
+		busy[i] = 0;
+	}
+
+	for (int i = 0; i < maxoff; i++)
+	{
+		OffsetNumber current = order[i];
+		if (current > maxoff || current == InvalidOffsetNumber)
+		{
+			elog(ERROR,"Page layout is broken: incorrect offset number %u at %i", current, i);
+		}
+		if (busy[current])
+		{
+			elog(ERROR,"Page layout is broken: offset number %u is used more than once at %i", current, i);
+		}
+		busy[current] = 1;
+	}
+	for (int i = FirstOffsetNumber; i <= maxoff; i++)
+	{	
+		if (!busy[i])
+		{
+			elog(ERROR,"Page layout is broken: offset number %u is not used", i);
+		}
+	}
+	return true;
+}
+
+#define USE_EYZINGER_ORDER
+static bool
+_bt_prepare_layout(Page	page, OffsetNumber* order)
+{
+	BTPageOpaque opaque;
+	OffsetNumber low,
+				high;
+	int			next = 0;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	/* check if there is something to defrag*/
+	if (high < low)
+		return false;
+
+	high++;						/* establish the loop invariant for high */
+
+#ifdef USE_BT_ORDER
+	_bt_metabtree_layout(order, &next, low, high);
+#elif defined(USE_VEB_ORDER)
+	_bt_veb_recursion(order, &next, low, high);
+#elif defined(USE_EYZINGER_ORDER)
+	_bt_eyzinger_recursion(order, &next, low, high);
+#elif defined(USE_SEQ_ORDER)
+	_bt_sequential_layout(order, &next, low, high);
+#endif
+
+	if (!P_RIGHTMOST(opaque))
+	{
+		order[next] = P_HIKEY;
+		next++;
+	}
+
+	Assert(next == PageGetMaxOffsetNumber(page));
+	Assert(_bt_check_layout(page, order));
+
+	return true;
+}
+
 /*
  * btvacuumpage --- VACUUM one page
  *
@@ -1114,6 +1318,8 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 	Page		page;
 	BTPageOpaque opaque = NULL;
 
+	OffsetNumber order[MaxOffsetNumber];
+
 restart:
 	delete_now = false;
 	recurse_to = P_NONE;
@@ -1350,7 +1556,13 @@ restart:
 		/* pagedel released buffer, so we shouldn't */
 	}
 	else
+	{
+#if defined(USE_BT_ORDER) || defined(USE_VEB_ORDER) || defined(USE_EYZINGER_ORDER) || defined(USE_SEQ_ORDER)
+		_bt_prepare_layout(page, order);
+		PageMakeSpecialFragmentation(page, order);
+#endif
 		_bt_relbuf(rel, buf);
+	}
 
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8b2772c154..e7c5308349 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -399,6 +399,25 @@ _bt_binsrch(Relation rel,
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
+#define USE_EYZINGER_ORDER
+#define USE_PREFETCH
+
+#ifdef USE_PREFETCH
+#ifdef USE_BT_ORDER
+		/* in this case we only need one prefetch */
+		OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+		if (x < high)
+			__builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+#else
+		OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+		if (x < high)
+			__builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+		x = low + ((mid - low) / 2);
+		if (x > low)
+			__builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+#endif
+#endif
+
 		/* We have low <= mid < high, so mid points at a real slot */
 
 		result = _bt_compare(rel, keysz, scankey, page, mid);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..e3f553478c 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -375,6 +375,32 @@ PageGetTempPageCopy(Page page)
 	return temp;
 }
 
+void
+PageMakeSpecialFragmentation(Page page, uint16 *order)
+{
+	Page temp = PageGetTempPageCopy(page);
+
+	PageHeader	phdr = (PageHeader) page;
+	Offset		upper;
+	int			i;
+	int			nitems = PageGetMaxOffsetNumber(page);
+
+	upper = phdr->pd_special;
+	for (i = nitems - 1; i >= 0; i--)
+	{
+		ItemId lp = PageGetItemId(page, order[i]);
+		upper -= MAXALIGN(ItemIdGetLength(lp));
+		memmove((char *) page + upper,
+				(char *) temp + ItemIdGetOffset(lp),
+				MAXALIGN(ItemIdGetLength(lp)));
+		lp->lp_off = upper;
+	}
+
+	phdr->pd_upper = upper;
+
+	pfree(temp);
+}
+
 /*
  * PageGetTempPageCopySpecial
  *		Get a temporary page in local memory for special processing.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 85dd10c45a..c6b31fa3e8 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -434,5 +434,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 						Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageMakeSpecialFragmentation(Page page, uint16 *order);
 
 #endif							/* BUFPAGE_H */
-- 
2.17.2 (Apple Git-113)

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Andrey Borodin (#8)

Re: B-tree cache prefetches

On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
[1] https://arxiv.org/pdf/1509.05053.pdf

I've implemented all of the strategies used in that paper.
On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at the end are mostly "appended", i.e. they usually are accommodated at the end of the free space.
The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM and can be easily broken with single insert.
Strategies are switched by #define:
1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to work similar to baseline (no changes to code at all)
2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a sequential read of bytes.
3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that you cache-miss when read mid, but tuples of both next steps are already fetched by that cache miss.
4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache prefetch.

Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.

I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
No changes in configs.

Here are results in tps:
without prefetch
baseline 1448.331199
seq 1433.737204
bt 1463.701527
veb 1457.586089
eyz 1483.654765

with prefetch
baseline 1486.585895
seq 1471.757042
bt 1480.169967
veb 1464.834678
eyz 1460.323392

This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned anyway.
From this I can conclude:
1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative B-tree performance tests?
2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting 2.6%
3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best.
4. Among layout strategies with prefetch, bt-order strategy seems to be winner.

How can I improve experimental evaluation of these layout strategies?
Other thoughts? I'd be happy to hear any comment on this.

What about read-only tests with prewarmed indexes? Those numbers look IO bound.

This is focusing on prefetch within btree code, but I wonder if there
is some lower hanging fruit that doesn't require any layout changes:
heap prefetch during index scans. By analogy to figure 8a
"software-pipelined prefetching" of [1]http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf, I wonder if you could build a
pipeline using dirty (unlocked) memory reads, like this:

1. Prefetch buffer mapping hash bucket for heap_tids[n + 3]
2. Prefetch page header for heap_tids[n + 2] using a dirty read of
the first value in the buffer mapping hash bucket
3. Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page
4. Return heap tid for tids[0] to caller

If you're lucky, by the time the caller uses the tid to the find the
buffer, read the line pointer and access the tuple, all three things
will be in cache and hopefully won't have changed since the 'dirty'
reads. Prefetching the wrong cachelines would be possible obviously,
if there is a buffer mapping hash table collision and the one you
wanted isn't first, or stuff just changed. Maybe there aren't enough
cachelines for this 3-level pipeline to work, considering that in
between random other executor nodes could run, so perhaps just a
2-level or 1-level pipeline would pay off. As I showed in my toy hash
join prefetch experiment[2]/messages/by-id/CA+hUKGKB=EWq+rj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ@mail.gmail.com, I could get 20-30% speedup from
prefetching just hash buckets (equivalent to prefetching buffer
mapping hash buckets), and then a further 5-10% speedup from
prefetching the first item in the hash bucket, but I expect nested
hash joins to interfere with that as they compete for cache lines. I
suppose I could try to do three levels and fetch the next tuple in the
chain, but that seems unlikely to help because we're getting into
lower probabilities that I'd even even want that data; in contrast,
the index scan -> heap scan case *definitely* needs to go through a
line pointer to a tuple somewhere on the page so 3-level might be
worth experimenting with. Just a though, I have not tried any of
this. Maybe there are really 4 levels, if you count the buffer
descriptor/lock.

[1]: http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf
[2]: /messages/by-id/CA+hUKGKB=EWq+rj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ@mail.gmail.com

--
Thomas Munro
https://enterprisedb.com