Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Started by Peter Geogheganalmost 6 years ago30 messages
3 attachment(s)

Andres and I discussed bottlenecks in the nbtree code during the
recent PgDay SF community event. Andres observed that the call to
BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a
bottleneck. I came up with the very simple attached POC-quality
patches, which Andres tested and profiled with his original complaint
in mind yesterday. He reported increased throughput on a memory
bound simple pgbench SELECT-only workload.

He reported that the problem went away with the patches applied. The
following pgbench SELECT-only result was sent to me privately:

before:
single: tps = 30300.202363 (excluding connections establishing)
all cores: tps = 1026853.447047 (excluding connections establishing)

after:
single: tps = 31120.452919 (excluding connections establishing)
all cores: tps = 1032376.361006 (excluding connections establishing)

(Actually, he tested something slightly different -- he inlined
_bt_compare() in his own way instead of using my 0002-*, and didn't
use the 0003-* optimization at all.)

Apparently this was a large multi-socket machine. Those are hard to
come by.

The main idea here is to make _bt_compare() delay
calling BTreeTupleGetNAtts() until the point after the first attribute
turns out to be equal to the _bt_compare() caller's insertion scankey.
Many or most calls to _bt_compare() won't even need to call
BTreeTupleGetNAtts().

This relies on the assumption that any tuple must have at least one
untruncated suffix column in the _bt_compare() loop. It doesn't matter
whether it's a pivot or non-pivot tuple -- the representation of the
first column will be exactly the same.

The negative infinity item on an internal page always has zero
attributes, which might seem like a snag. However, we already treat
that as a special case within _bt_compare(), for historical reasons
(pg_upgrade'd indexes won't have the number of attributes explicitly
set to zero in some cases).

Another separate though related idea in 0003-* is to remove the
negative infinity check. It goes from _bt_compare() to _bt_binsrch(),
since it isn't something that we need to consider more than once per
page -- and only on internal pages. That way, _bt_compare() doesn't
have to look at the page special area to check if it's a leaf page or
an internal page at all. I haven't really profiled this just yet. This is
one of those patches where 95%+ of the work is profiling and
benchmarking.

Andres and I both agree that there is a lot more work to be done in
this area, but that will be a major undertaking. I am quite keen on
the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to
store an abbreviated key. Making that work well is a considerable
undertaking, since you need to use prefix compression to get a high
entropy abbreviated key. It would probably take me the best part of a
whole release cycle to write such a patch. The attached patches get
us a relatively easy win in the short term, though.

Thoughts?
--
Peter Geoghegan

Attachments:

v1-0001-Avoid-calling-BTreeTupleGetNAtts-in-_bt_compare.patchapplication/octet-stream; name=v1-0001-Avoid-calling-BTreeTupleGetNAtts-in-_bt_compare.patchDownload
From 853c47f71a8c8c52236bea5786048b3ff2b96b66 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 18:00:38 -0800
Subject: [PATCH v1 1/3] Avoid calling BTreeTupleGetNAtts() in _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..96119949a0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -560,9 +560,9 @@ _bt_compare(Relation rel,
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
+	int			ski;
 	ScanKey		scankey;
-	int			ncmpkey;
-	int			ntupatts;
+	int			ntupatts = 0;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -576,7 +576,6 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -588,12 +587,17 @@ _bt_compare(Relation rel,
 	 * We don't test for violation of this condition here, however.  The
 	 * initial setup for the index scan had better have gotten it right (see
 	 * _bt_first).
+	 *
+	 * We rely on the assumption that every tuple has at least one
+	 * non-truncated attribute from here on.  Delaying our call to
+	 * BTreeTupleGetNAtts() till the end of the first loop iteration has been
+	 * shown to increase throughput on memory-bound workloads.
 	 */
 
-	ncmpkey = Min(ntupatts, key->keysz);
-	Assert(key->heapkeyspace || ncmpkey == key->keysz);
+	Assert(BTreeTupleGetNAtts(itup, rel) >= key->keysz);
+	ski = 1;
 	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	for (;;)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -641,7 +645,13 @@ _bt_compare(Relation rel,
 		if (result != 0)
 			return result;
 
+		ski++;
 		scankey++;
+		if (!ntupatts)
+			ntupatts = BTreeTupleGetNAtts(itup, rel);
+
+		if (ski > key->keysz || ski > ntupatts)
+			break;
 	}
 
 	/*
-- 
2.17.1

v1-0003-Remove-negative-infinity-check-from-_bt_compare.patchapplication/octet-stream; name=v1-0003-Remove-negative-infinity-check-from-_bt_compare.patchDownload
From f87a2d751c3ee3a3827336f1fa176d493c2cdd09 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:59:57 -0800
Subject: [PATCH v1 3/3] Remove "negative infinity" check from _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 32 ++++++++++++++++++---------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fdc1dab114..cfde241809 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -348,6 +348,8 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	bool		isleaf;
+	OffsetNumber noff;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -359,6 +361,7 @@ _bt_binsrch(Relation rel,
 
 	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
+	isleaf = P_ISLEAF(opaque);
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -386,13 +389,20 @@ _bt_binsrch(Relation rel,
 
 	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
+	if (isleaf)
+		noff = InvalidOffsetNumber;
+	else
+		noff = P_FIRSTDATAKEY(opaque);
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare_inl(rel, key, page, mid);
+		if (mid == noff)
+			result = 1;
+		else
+			result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -407,7 +417,7 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
@@ -536,6 +546,16 @@ _bt_compare(Relation rel,
 			Page page,
 			OffsetNumber offnum)
 {
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Force result ">" if target item is first data item on an internal page
+	 * --- see NOTE above.
+	 */
+	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+		return 1;
+
+
 	return _bt_compare_inl(rel, key, page, offnum);
 }
 
@@ -568,7 +588,6 @@ _bt_compare_inl(Relation rel,
 				OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
 	int			ski;
@@ -579,13 +598,6 @@ _bt_compare_inl(Relation rel,
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(key->heapkeyspace || key->scantid == NULL);
 
-	/*
-	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
-	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
-
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
 	/*
-- 
2.17.1

v1-0002-Inline-_bt_compare.patchapplication/octet-stream; name=v1-0002-Inline-_bt_compare.patchDownload
From b50fe76d0c1d1bee4fffd6fc74edb4d5dc738139 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:35:04 -0800
Subject: [PATCH v1 2/3] Inline _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 96119949a0..fdc1dab114 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,6 +26,8 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_inl(Relation rel, BTScanInsert key, Page page,
+									OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -298,7 +300,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare_inl(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -390,7 +392,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -499,7 +501,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -528,6 +530,15 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+int32
+_bt_compare(Relation rel,
+			BTScanInsert key,
+			Page page,
+			OffsetNumber offnum)
+{
+	return _bt_compare_inl(rel, key, page, offnum);
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -550,11 +561,11 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  * key.  See backend/access/nbtree/README for details.
  *----------
  */
-int32
-_bt_compare(Relation rel,
-			BTScanInsert key,
-			Page page,
-			OffsetNumber offnum)
+static inline int32
+_bt_compare_inl(Relation rel,
+				BTScanInsert key,
+				Page page,
+				OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-- 
2.17.1

#2Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#1)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

He reported that the problem went away with the patches applied. The
following pgbench SELECT-only result was sent to me privately:

before:
single: tps = 30300.202363 (excluding connections establishing)
all cores: tps = 1026853.447047 (excluding connections establishing)

after:
single: tps = 31120.452919 (excluding connections establishing)
all cores: tps = 1032376.361006 (excluding connections establishing)

(Actually, he tested something slightly different -- he inlined
_bt_compare() in his own way instead of using my 0002-*, and didn't use the
0003-* optimization at all.)

Apparently this was a large multi-socket machine. Those are hard to come by.

I could do some tests with the patch on some larger machines. What exact tests do you propose? Are there some specific postgresql.conf settings and pgbench initialization you recommend for this? And was the test above just running 'pgbench -S' select-only with specific -T, -j and -c parameters?

-Floris

#3Andres Freund
andres@anarazel.de
In reply to: Floris Van Nee (#2)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Hi,

On 2020-01-27 15:42:06 +0000, Floris Van Nee wrote:

He reported that the problem went away with the patches applied. The
following pgbench SELECT-only result was sent to me privately:

before:
single: tps = 30300.202363 (excluding connections establishing)
all cores: tps = 1026853.447047 (excluding connections establishing)

after:
single: tps = 31120.452919 (excluding connections establishing)
all cores: tps = 1032376.361006 (excluding connections establishing)

(Actually, he tested something slightly different -- he inlined
_bt_compare() in his own way instead of using my 0002-*, and didn't use the
0003-* optimization at all.)

Apparently this was a large multi-socket machine. Those are hard to
come by.

I'd not say "large multi socket", 2 x XeonGold 5215, 192GB RAM.

I could do some tests with the patch on some larger machines. What
exact tests do you propose? Are there some specific postgresql.conf
settings and pgbench initialization you recommend for this? And was
the test above just running 'pgbench -S' select-only with specific -T,
-j and -c parameters?

The above test was IIRC:

PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -q -s 300
with a restart here, and a
SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relkind IN ('r', 'i', 't');
after starting, and then
PGOPTIONS='-c default_transaction_isolation=repeatable\ read' pgbench -n -M prepared -P1 -c100 -j72 -T1000 -S

The freeze, restart & prewarm are to have fairer comparisons between
tests, without needing to recreate the database from scratch.

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#1)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Hi,

On 2020-01-26 14:49:06 -0800, Peter Geoghegan wrote:

Andres and I discussed bottlenecks in the nbtree code during the
recent PgDay SF community event. Andres observed that the call to
BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a
bottleneck. I came up with the very simple attached POC-quality
patches, which Andres tested and profiled with his original complaint
in mind yesterday. He reported increased throughput on a memory
bound simple pgbench SELECT-only workload.

Yea - it shows up as a pipeline stall, because the loop depends on
having loaded the tuple. Which basically requires two
unlikely-to-be-cached memory loads to complete. Whereas before/after the
patcha good bit of that latency could be hidden by out-of-order
execution, as e.g. the tupledesc and scankey accesses are not dependent
on the memory load for the tuple having finished.

The main idea here is to make _bt_compare() delay
calling BTreeTupleGetNAtts() until the point after the first attribute
turns out to be equal to the _bt_compare() caller's insertion scankey.
Many or most calls to _bt_compare() won't even need to call
BTreeTupleGetNAtts().

FWIW, I still think it might be better to *continue* loading ntupatts
where it's currently done, but keep the the change to the loop
termination the way you have it. That way we don't add a branch to check
for ntupatts, and because we don't depend on the result to enter the
loop, we don't have a stall. I'd even keep the Min() bit. I'm a bit
afraid that delaying the load will add one (smaller) stall after the key
comparison, and that the additional branches will be noticable too.

Andres and I both agree that there is a lot more work to be done in
this area, but that will be a major undertaking. I am quite keen on
the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to
store an abbreviated key. Making that work well is a considerable
undertaking, since you need to use prefix compression to get a high
entropy abbreviated key. It would probably take me the best part of a
whole release cycle to write such a patch. The attached patches get
us a relatively easy win in the short term, though.

My intuition is that a binary search optimized layout (next para) will
be a bigger win, and probably easier. There are pretty clear profiling
indicators that even the access to the ItemId array in the binary search
is most of the time a cache miss and causes a stall - and it makes sense
too.

I.e. instead of a plain sorted order, store the n ItemIds on a page
in the order of
[1/2 n, 1/4 n, 3/4 n, 1/8 n, 3/8 n, 5/8 n, 7/8 n, ...]
as binary search looks first at 1/2, then at either 1/4 or 3/4, then
either (1/8 or 3/8) or (5/8, 7/8), and so on, this layout means that the
commonly the first few four levels of the ItemId array are on a *single*
cacheline. Whereas in contrast, using the normal layout, that *never* is
the case for page with more than a few entries. And even beyond the
first few levels, the "sub" trees the binary search descends into, are
concentrated onto fewer cachelines. It's not just the reduced number of
cachelines touched, additionally the layout also is very prefetchable,
because the tree levels are basically laid out sequentially left to
right, which many cpu prefetchers can recognize.

I think this works particularly well for inner btree pages, because we
don't rewrite their itemid lists all that often, so the somewhat higher
cost of that doesn't matter much, and similarly, the higher cost of
sequentially iterating, isn't significant either.

Now that's only the ItemId array - whereas a larger amount of the cache
misses comes from the index tuple accesses. The nice bit there is that
we can just optimize the order of the index tuples on the page without
any format changes (and even the read access code won't change). I.e. we
can just lay out the tuples in an *approximately* binary search
optimized order, without needing to change anything but the layout
"writing" code, as the ItemId.lp_off indirection will hide that.

I do completely agree that having a small high-entropy abbreviated key
inside the ItemId would be an awesome improvement, as it can entirely
remove many of the hard to predict accesses. My gut feeling is just that
a) it's a pretty darn hard project.
b) it'll be a smaller win as long as there's an unpredictable access to
the abbreviated key

Greetings,

Andres Freund

In reply to: Andres Freund (#4)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mon, Jan 27, 2020 at 9:14 AM Andres Freund <andres@anarazel.de> wrote:

The main idea here is to make _bt_compare() delay
calling BTreeTupleGetNAtts() until the point after the first attribute
turns out to be equal to the _bt_compare() caller's insertion scankey.
Many or most calls to _bt_compare() won't even need to call
BTreeTupleGetNAtts().

FWIW, I still think it might be better to *continue* loading ntupatts
where it's currently done, but keep the the change to the loop
termination the way you have it. That way we don't add a branch to check
for ntupatts, and because we don't depend on the result to enter the
loop, we don't have a stall. I'd even keep the Min() bit. I'm a bit
afraid that delaying the load will add one (smaller) stall after the key
comparison, and that the additional branches will be noticable too.

I can do it that way. I am not attached to the current approach in
0001-* at all.

My intuition is that a binary search optimized layout (next para) will
be a bigger win, and probably easier. There are pretty clear profiling
indicators that even the access to the ItemId array in the binary search
is most of the time a cache miss and causes a stall - and it makes sense
too.

I.e. instead of a plain sorted order, store the n ItemIds on a page
in the order of
[1/2 n, 1/4 n, 3/4 n, 1/8 n, 3/8 n, 5/8 n, 7/8 n, ...]
as binary search looks first at 1/2, then at either 1/4 or 3/4, then
either (1/8 or 3/8) or (5/8, 7/8), and so on, this layout means that the
commonly the first few four levels of the ItemId array are on a *single*
cacheline.

You don't really have to convince me of anything here. I see these as
essentially the same project already. I am only really emphasizing the
abbreviated keys thing because it's obviously unbeatable with the
right workload.

Working on B-Tree stuff for v12 really convinced me of the value of an
integrated approach, at least in this area. Everything affects
everything else, so expanding the scope of a project can actually be
really helpful. It's very normal for these optimizations to be worth a
lot more when combined than they are worth individually. I know that
you have had similar experiences in other areas of the code.

I think this works particularly well for inner btree pages, because we
don't rewrite their itemid lists all that often, so the somewhat higher
cost of that doesn't matter much, and similarly, the higher cost of
sequentially iterating, isn't significant either.

Obviously all of these techniques are only practical because of the
asymmetry between leaf pages and internal pages. Internal pages are
where the large majority of comparisons are performed in most OLTP
workloads, and yet their tuples are often only about one third of one
percent of the total number of tuples in the B-Tree. That is the
specific ratio within the pgbench indexes, IIRC. Having more than one
percent of all tuples come from internal pages is fairly exceptional
-- you only really see it in indexes that are on very long text
strings.

I do completely agree that having a small high-entropy abbreviated key
inside the ItemId would be an awesome improvement, as it can entirely
remove many of the hard to predict accesses. My gut feeling is just that
a) it's a pretty darn hard project.
b) it'll be a smaller win as long as there's an unpredictable access to
the abbreviated key

It will be relatively straightforward to come up with a basic
abbreviated keys prototype that targets one particular data
distribution and index type, though. For example, I can focus on your
pgbench SELECT workload. That way, I won't have to do any of the hard
work of making abbreviated keys work with a variety of workloads,
while still getting a good idea of the benefits in one specific case.
For this prototype, I can either not do prefix compression to get a
high entropy abbreviated key, or do the prefix compression in a way
that is totally inflexible, but still works well enough for this
initial test workload.

My estimate is that it would take me 4 - 6 weeks to write a prototype
along those lines. That isn't so bad.

--
Peter Geoghegan

#6Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#5)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

I could do some tests with the patch on some larger machines. What exact
tests do you propose? Are there some specific postgresql.conf settings and
pgbench initialization you recommend for this? And was the test above just
running 'pgbench -S' select-only with specific -T, -j and -c parameters?

With Andres' instructions I ran a couple of tests. With your patches I can reproduce a speedup of ~3% on single core tests reliably on a dual-socket 36-core machine for the pgbench select-only test case. When using the full scale test my results are way too noisy even for large runs unfortunately. I also tried some other queries (for example select's that return 10 or 100 rows instead of just 1), but can't see much of a speed-up there either, although it also doesn't hurt.

So I guess the most noticeable one is the select-only benchmark for 1 core:

<Master>
transaction type: <builtin: select only>
scaling factor: 300
query mode: prepared
number of clients: 1
number of threads: 1
duration: 600 s
number of transactions actually processed: 30255419
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 50425.693234 (including connections establishing)
tps = 50425.841532 (excluding connections establishing)

<Patched>
transaction type: <builtin: select only>
scaling factor: 300
query mode: prepared
number of clients: 1
number of threads: 1
duration: 600 s
number of transactions actually processed: 31363398
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52272.326597 (including connections establishing)
tps = 52272.476380 (excluding connections establishing)

This is the one with 40 clients, 40 threads. Not really an improvement, and quite still quite noisy.
<Master>
transaction type: <builtin: select only>
scaling factor: 300
query mode: prepared
number of clients: 40
number of threads: 40
duration: 600 s
number of transactions actually processed: 876846915
latency average = 0.027 ms
latency stddev = 0.015 ms
tps = 1461407.539610 (including connections establishing)
tps = 1461422.084486 (excluding connections establishing)

<Patched>
transaction type: <builtin: select only>
scaling factor: 300
query mode: prepared
number of clients: 40
number of threads: 40
duration: 600 s
number of transactions actually processed: 872633979
latency average = 0.027 ms
latency stddev = 0.038 ms
tps = 1454387.326179 (including connections establishing)
tps = 1454396.879195 (excluding connections establishing)

For tests that don't use the full machine (eg. 10 clients, 10 threads) I see speed-ups as well, but not as high as the single-core run. It seems there are other bottlenecks (on the machine) coming into play.

-Floris

In reply to: Floris Van Nee (#6)
3 attachment(s)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Tue, Jan 28, 2020 at 1:34 PM Floris Van Nee <florisvannee@optiver.com> wrote:

With Andres' instructions I ran a couple of tests. With your patches I can reproduce a speedup of ~3% on single core tests reliably on a dual-socket 36-core machine for the pgbench select-only test case.

Thanks for testing!

Attached is v2 of patch series, which makes the changes to 0001-*
requested by Andres. I restructured the loop in a way that allows the
compiler to assume that there will always be at least one loop
iteration -- so I'm not quite as aggressive as I was with v1. We don't
actually delay the call to BTreeTupleGetNAtts() as such in v2.

Can you test this version, Floris? The second two patches are probably
not helping here, so it would be useful if you could just test 0001-*,
and then test all three together. I can toss the latter two patches if
there is no additional speedup.

If we're lucky, then Andres will have been right to suspect that there
might be a smaller stall caused by the new branch in the loop that
existed in v1. Maybe this will show up at higher client counts.

I should point out that the deduplication patch changes the definition
of BTreeTupleGetNAtts(), making it slightly more complicated. With the
deduplication patch, we have to check that the tuple isn't a posting
list tuple, which uses the INDEX_ALT_TID_MASK/INDEX_AM_RESERVED_BIT
bit to indicate a non-standard tuple header format, just like the
current pivot tuple format (we need to check a BT_RESERVED_OFFSET_MASK
bit to further differentiate posting list tuples from pivot tuples).
This increase in the complexity of BTreeTupleGetNAtts() will probably
further tip things in favor of this patch.

There are no changes in either 0002-* or 0003-* patches for v2. I'm
including the same patches here a second time for completeness.

--
Peter Geoghegan

Attachments:

v2-0003-Remove-negative-infinity-check-from-_bt_compare.patchapplication/x-patch; name=v2-0003-Remove-negative-infinity-check-from-_bt_compare.patchDownload
From 8f55bcedaa9c109543627e9c785dab0ffb0e5c75 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:59:57 -0800
Subject: [PATCH v2 3/3] Remove "negative infinity" check from _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 32 ++++++++++++++++++---------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index dd56fdaaea..b2a2605c47 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -348,6 +348,8 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	bool		isleaf;
+	OffsetNumber noff;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -359,6 +361,7 @@ _bt_binsrch(Relation rel,
 
 	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
+	isleaf = P_ISLEAF(opaque);
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -386,13 +389,20 @@ _bt_binsrch(Relation rel,
 
 	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
+	if (isleaf)
+		noff = InvalidOffsetNumber;
+	else
+		noff = P_FIRSTDATAKEY(opaque);
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare_inl(rel, key, page, mid);
+		if (mid == noff)
+			result = 1;
+		else
+			result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -407,7 +417,7 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
@@ -536,6 +546,16 @@ _bt_compare(Relation rel,
 			Page page,
 			OffsetNumber offnum)
 {
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Force result ">" if target item is first data item on an internal page
+	 * --- see NOTE above.
+	 */
+	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+		return 1;
+
+
 	return _bt_compare_inl(rel, key, page, offnum);
 }
 
@@ -568,7 +588,6 @@ _bt_compare_inl(Relation rel,
 				OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
 	int			ski;
@@ -580,13 +599,6 @@ _bt_compare_inl(Relation rel,
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(key->heapkeyspace || key->scantid == NULL);
 
-	/*
-	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
-	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
-
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
-- 
2.17.1

v2-0002-Inline-_bt_compare.patchapplication/x-patch; name=v2-0002-Inline-_bt_compare.patchDownload
From aa6d3365da2bcd6fdeaaffb7a4ac0f783538b522 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:35:04 -0800
Subject: [PATCH v2 2/3] Inline _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7e4cd9c5ad..dd56fdaaea 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,6 +26,8 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_inl(Relation rel, BTScanInsert key, Page page,
+									OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -298,7 +300,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare_inl(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -390,7 +392,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -499,7 +501,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -528,6 +530,15 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+int32
+_bt_compare(Relation rel,
+			BTScanInsert key,
+			Page page,
+			OffsetNumber offnum)
+{
+	return _bt_compare_inl(rel, key, page, offnum);
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -550,11 +561,11 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  * key.  See backend/access/nbtree/README for details.
  *----------
  */
-int32
-_bt_compare(Relation rel,
-			BTScanInsert key,
-			Page page,
-			OffsetNumber offnum)
+static inline int32
+_bt_compare_inl(Relation rel,
+				BTScanInsert key,
+				Page page,
+				OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-- 
2.17.1

v2-0001-Avoid-pipeline-stall-in-_bt_compare.patchapplication/x-patch; name=v2-0001-Avoid-pipeline-stall-in-_bt_compare.patchDownload
From 73058e520cce39a598a1011ec05f84b4f7ee8d97 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 18:00:38 -0800
Subject: [PATCH v2 1/3] Avoid pipeline stall in _bt_compare().

Author: Peter Geoghegan
Reviewed-By: Andres Freund, Floris Van Nee
Discussion: https://postgr.es/m/CAH2-Wzngz5MrkiTaZNny0GzfTxNQE+QWPPaO-C390BgruhgjEw@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..7e4cd9c5ad 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -560,6 +560,7 @@ _bt_compare(Relation rel,
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
+	int			ski;
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
@@ -591,9 +592,11 @@ _bt_compare(Relation rel,
 	 */
 
 	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(ntupatts >= 1);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
+	ski = 1;
 	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	for (;;)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -641,7 +644,18 @@ _bt_compare(Relation rel,
 		if (result != 0)
 			return result;
 
+		/*
+		 * The loop is deliberately structured in a way that enables the
+		 * compiler to assume that the first iteration always runs.  Testing
+		 * has shown that this avoids a pipeline stall with certain
+		 * memory-bound workloads.  We delay this test, since it depends on
+		 * whether or not caller's tuple is a pivot tuple.  Typically, most
+		 * calls here never reach this far.
+		 */
+		ski++;
 		scankey++;
+		if (ski > ncmpkey)
+			break;
 	}
 
 	/*
-- 
2.17.1

#8Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#7)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Can you test this version, Floris? The second two patches are probably not
helping here, so it would be useful if you could just test 0001-*, and then test
all three together. I can toss the latter two patches if there is no additional
speedup.

Here's the results for runs with respectively 1 client, 9 clients and 30 clients on master, v2-0001, v2-0001+0002+0003 and for completeness also the previous v1 version of the patches.
I ran the tests for 45 minutes each this time which seems to give more stable results.
I'd say applying just v2-0001 is actually slightly hurtful for single-core performance. Applying all of them gives a good improvement though. It looks like the performance improvement is also more noticeable at higher core counts now.

<master>
number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 139314796
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 51598.071835 (including connections establishing)
tps = 51598.098715 (excluding connections establishing)

<v2-0001>
number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 137257492
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 50836.107076 (including connections establishing)
tps = 50836.133137 (excluding connections establishing)

<v2-0001+2+3>
number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 141721881
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52489.584928 (including connections establishing)
tps = 52489.611373 (excluding connections establishing)

<v1-0001+2+3>
number of clients: 1
number of threads: 1
duration: 2700 s
number of transactions actually processed: 141663780
latency average = 0.019 ms
latency stddev = 0.001 ms
tps = 52468.065549 (including connections establishing)
tps = 52468.093018 (excluding connections establishing)

<master>
number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1197242115
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 443422.987601 (including connections establishing)
tps = 443423.306495 (excluding connections establishing)

<v2-0001>
number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1187890004
latency average = 0.020 ms
latency stddev = 0.002 ms
tps = 439959.241392 (including connections establishing)
tps = 439959.588125 (excluding connections establishing)

<v2-0001+2+3>
number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1203412941
latency average = 0.020 ms
latency stddev = 0.002 ms
tps = 445708.478915 (including connections establishing)
tps = 445708.798583 (excluding connections establishing)

<v1-0001+2+3>
number of clients: 9
number of threads: 9
duration: 2700 s
number of transactions actually processed: 1195359533
latency average = 0.020 ms
latency stddev = 0.001 ms
tps = 442725.734267 (including connections establishing)
tps = 442726.052676 (excluding connections establishing)

<master>
number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2617037081
latency average = 0.031 ms
latency stddev = 0.011 ms
tps = 969272.811990 (including connections establishing)
tps = 969273.960316 (excluding connections establishing)

<v2-0001>
number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2736881585
latency average = 0.029 ms
latency stddev = 0.011 ms
tps = 1013659.581348 (including connections establishing)
tps = 1013660.819277 (excluding connections establishing)

<v2-0001+2+3>
number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2844199686
latency average = 0.028 ms
latency stddev = 0.011 ms
tps = 1053407.074721 (including connections establishing)
tps = 1053408.220093 (excluding connections establishing)

<v1-0001+2+3>
number of clients: 30
number of threads: 30
duration: 2700 s
number of transactions actually processed: 2693765822
latency average = 0.030 ms
latency stddev = 0.011 ms
tps = 997690.883117 (including connections establishing)
tps = 997692.051005 (excluding connections establishing)

In reply to: Floris Van Nee (#8)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Thu, Jan 30, 2020 at 1:19 AM Floris Van Nee <florisvannee@optiver.com> wrote:

I'd say applying just v2-0001 is actually slightly hurtful for single-core performance. Applying all of them gives a good improvement though. It looks like the performance improvement is also more noticeable at higher core counts now.

Many thanks for testing once again!

Your tests show that the overall winner is "<v2-0001+2+3>", which is
strictly better than all other configurations you tested -- it is at
least slightly better than every other configuration at every client
count tested. I was particularly pleased to see that "<v2-0001+2+3>"
is ~8.6% faster than the master branch with 30 clients! That result
greatly exceeded my expectations.

I have been able to independently confirm that you really need the
first two patches together to see the benefits -- that wasn't clear
until recently.

The interesting thing now is the role of the "negative infinity test"
patch (the 0003-* patch) in all of this. I suspect that it may not be
helping us much here. I wonder, could you test the following
configurations to settle this question?

* <master> with 30 clients (i.e. repeat the test that you reported on
most recently)

* <v2-0001+2+3> with 30 clients (i.e. repeat the test that you
reported got us that nice ~8.6% increase in TPS)

* <v2-0001+2> with 30 clients -- a new test, to see if performance is
at all helped by the "negative infinity test" patch (the 0003-*
patch).

It seems like a good idea to repeat the other two tests as part of
performing this third test, out of general paranoia. Intel seem to
roll out a microcode update for a spectre-like security issue about
every other day.

Thanks again
--
Peter Geoghegan

#10Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#9)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

The interesting thing now is the role of the "negative infinity test"
patch (the 0003-* patch) in all of this. I suspect that it may not be helping us
much here. I wonder, could you test the following configurations to settle
this question?

* <master> with 30 clients (i.e. repeat the test that you reported on most
recently)

* <v2-0001+2+3> with 30 clients (i.e. repeat the test that you reported got us
that nice ~8.6% increase in TPS)

* <v2-0001+2> with 30 clients -- a new test, to see if performance is at all
helped by the "negative infinity test" patch (the 0003-* patch).

It seems like a good idea to repeat the other two tests as part of performing
this third test, out of general paranoia. Intel seem to roll out a microcode
update for a spectre-like security issue about every other day.

I ran all the tests on two different machines, several times for 1 hour each time. I'm still having a hard time getting reliable results for the 30 clients case though. I'm pretty certain the patches bring a performance benefit, but how high exactly is difficult to say. As for applying only patch 1+2 or all three patches - I found no significant difference between these two cases. It looks like all the performance benefit is in the first two patches.

-Floris

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Floris Van Nee (#10)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mon, Feb 10, 2020 at 10:05 PM Floris Van Nee
<florisvannee@optiver.com> wrote:

The interesting thing now is the role of the "negative infinity test"
patch (the 0003-* patch) in all of this. I suspect that it may not be helping us
much here. I wonder, could you test the following configurations to settle
this question?

* <master> with 30 clients (i.e. repeat the test that you reported on most
recently)

* <v2-0001+2+3> with 30 clients (i.e. repeat the test that you reported got us
that nice ~8.6% increase in TPS)

* <v2-0001+2> with 30 clients -- a new test, to see if performance is at all
helped by the "negative infinity test" patch (the 0003-* patch).

It seems like a good idea to repeat the other two tests as part of performing
this third test, out of general paranoia. Intel seem to roll out a microcode
update for a spectre-like security issue about every other day.

I ran all the tests on two different machines, several times for 1 hour each time. I'm still having a hard time getting reliable results for the 30 clients case though. I'm pretty certain the patches bring a performance benefit, but how high exactly is difficult to say. As for applying only patch 1+2 or all three patches - I found no significant difference between these two cases. It looks like all the performance benefit is in the first two patches.

The cfbot seems to be showing "pg_regress: initdb failed" on Ubuntu,
with an assertion failure like this:

#2 0x00000000008e594f in ExceptionalCondition
(conditionName=conditionName@entry=0x949098 "BTreeTupleGetNAtts(itup,
rel) >= key->keysz", errorType=errorType@entry=0x938a7d
"FailedAssertion", fileName=fileName@entry=0x949292 "nbtsearch.c",
lineNumber=lineNumber@entry=620) at assert.c:67
No locals.
#3 0x00000000004fdbaa in _bt_compare_inl (offnum=3,
page=0x7ff7904bdf00 "", key=0x7ffde7c9bfa0, rel=0x7ff7a2325c20) at
nbtsearch.c:620
itup = 0x7ff7904bfec8
heapTid = <optimized out>
ski = <optimized out>
itupdesc = 0x7ff7a2325f50
scankey = <optimized out>
ntupatts = 0

https://travis-ci.org/postgresql-cfbot/postgresql/builds/651843143

It's passing on Windows though, so perhaps there is something
uninitialised or otherwise unstable in the patch?

In reply to: Thomas Munro (#11)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Tue, Feb 18, 2020 at 12:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The cfbot seems to be showing "pg_regress: initdb failed" on Ubuntu,
with an assertion failure like this:

#2 0x00000000008e594f in ExceptionalCondition
(conditionName=conditionName@entry=0x949098 "BTreeTupleGetNAtts(itup,
rel) >= key->keysz", errorType=errorType@entry=0x938a7d
"FailedAssertion", fileName=fileName@entry=0x949292 "nbtsearch.c",

This is a legitimate bug in v1 of the patch, which was written in a
hurry. v2 does not have the problem. Floris inadvertently created a
separate thread for this same patch, which I responded to when posting
v2. I've now updated the CF entry for this patch [1]https://commitfest.postgresql.org/27/2429/# to have both
threads.

BTW, I've noticed that CF Tester is wonky with patches that have
multiple threads with at least one patch file posted to each thread.
The deduplication patch [2]https://commitfest.postgresql.org/27/2202/ -- Peter Geoghegan has this problem, for example. It would be
nice if CF Tester knew to prefer one thread over another based on a
simple rule, like "consistently look for patch files on the first
thread connected to a CF app entry, never any other thread".

Maybe you'd rather not go that way -- I guess that it would break
other cases, such as the CF app entry for this patch (which now
technically has one thread that supersedes the other). Perhaps a
compromise is possible. At a minimum, CF Tester should not look for a
patch on the (say) second thread of a CF app entry for a patch just
because somebody posted an e-mail to that thread (an e-mail that did
not contain a new patch). CF Tester will do this even though there is
a more recent patch on the first thread of the CF app entry, that has
already been accepted as passing by CFTester. I believe that CF Tester
will actually pingpong back and forth between the same two patches on
each thread as e-mail is sent to each thread, without anybody ever
posting a new patch.

Thanks

[1]: https://commitfest.postgresql.org/27/2429/#
[2]: https://commitfest.postgresql.org/27/2202/ -- Peter Geoghegan
--
Peter Geoghegan

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Wed, Feb 19, 2020 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Feb 18, 2020 at 12:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The cfbot seems to be showing "pg_regress: initdb failed" on Ubuntu,
with an assertion failure like this:

#2 0x00000000008e594f in ExceptionalCondition
(conditionName=conditionName@entry=0x949098 "BTreeTupleGetNAtts(itup,
rel) >= key->keysz", errorType=errorType@entry=0x938a7d
"FailedAssertion", fileName=fileName@entry=0x949292 "nbtsearch.c",

This is a legitimate bug in v1 of the patch, which was written in a
hurry. v2 does not have the problem. Floris inadvertently created a
separate thread for this same patch, which I responded to when posting
v2. I've now updated the CF entry for this patch [1] to have both
threads.

BTW, I've noticed that CF Tester is wonky with patches that have
multiple threads with at least one patch file posted to each thread.
The deduplication patch [2] has this problem, for example. It would be
nice if CF Tester knew to prefer one thread over another based on a
simple rule, like "consistently look for patch files on the first
thread connected to a CF app entry, never any other thread".

Ahh. Well I had that rule early on, and then had the problem that
some discussions move entirely to a second or third thread and left it
testing some ancient stale patch.

Maybe you'd rather not go that way -- I guess that it would break
other cases, such as the CF app entry for this patch (which now
technically has one thread that supersedes the other). Perhaps a
compromise is possible. At a minimum, CF Tester should not look for a
patch on the (say) second thread of a CF app entry for a patch just
because somebody posted an e-mail to that thread (an e-mail that did
not contain a new patch). CF Tester will do this even though there is
a more recent patch on the first thread of the CF app entry, that has
already been accepted as passing by CFTester. I believe that CF Tester
will actually pingpong back and forth between the same two patches on
each thread as e-mail is sent to each thread, without anybody ever
posting a new patch.

Yeah, for CF entries with multiple threads, it currently looks at
whichever thread has the most recent email on it, and then it finds
the most recent patch set on that thread. Perhaps it should look at
all the registered threads and find the message with the newest patch
set across all of them, as you say. I will look into that.

In reply to: Thomas Munro (#13)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Tue, Feb 18, 2020 at 4:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah, for CF entries with multiple threads, it currently looks at
whichever thread has the most recent email on it, and then it finds
the most recent patch set on that thread. Perhaps it should look at
all the registered threads and find the message with the newest patch
set across all of them, as you say. I will look into that.

Thanks!

I know that I am a bit unusual in that I use all of the CF app
features at the same time. But the current behavior of CF Tester
disincentivizes using multiple threads. This works against the goal of
having good metadata about patches that are worked on over multiple
releases or years. We have a fair few of those.

--
Peter Geoghegan

In reply to: Floris Van Nee (#10)
2 attachment(s)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mon, Feb 10, 2020 at 1:05 AM Floris Van Nee <florisvannee@optiver.com> wrote:

I ran all the tests on two different machines, several times for 1 hour each time. I'm still having a hard time getting reliable results for the 30 clients case though. I'm pretty certain the patches bring a performance benefit, but how high exactly is difficult to say. As for applying only patch 1+2 or all three patches - I found no significant difference between these two cases. It looks like all the performance benefit is in the first two patches.

Attached is v3, which no longer includes the third patch/optimization.
It also inlines (in the second patch) by marking the _bt_compare
definition as inline, while not changing anything in nbtree.h. I
believe that this is portable C99 -- let's see what CF Tester thinks
of it.

I'm going to test this myself. It would be nice if you could repeat
something like the previous experiments with v3, Floris. master vs v3
(both patches together). With a variable number of clients.

Thanks
--
Peter Geoghegan

Attachments:

v3-0002-Inline-_bt_compare.patchapplication/octet-stream; name=v3-0002-Inline-_bt_compare.patchDownload
From 7c59277066875a1353ed23cb95bb67607760ce19 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:35:04 -0800
Subject: [PATCH v3 2/2] Inline _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 521868cf03..9d7be5b2e2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -550,7 +550,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  * key.  See backend/access/nbtree/README for details.
  *----------
  */
-int32
+inline int32
 _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
-- 
2.17.1

v3-0001-Avoid-pipeline-stall-in-_bt_compare.patchapplication/octet-stream; name=v3-0001-Avoid-pipeline-stall-in-_bt_compare.patchDownload
From 7dcad51ecb4b1abb9c74afb8b19e485f743d6df2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 18:00:38 -0800
Subject: [PATCH v3 1/2] Avoid pipeline stall in _bt_compare().

Author: Peter Geoghegan
Reviewed-By: Andres Freund, Floris Van Nee
Discussion: https://postgr.es/m/CAH2-Wzngz5MrkiTaZNny0GzfTxNQE+QWPPaO-C390BgruhgjEw@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index df065d72f8..521868cf03 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -560,6 +560,7 @@ _bt_compare(Relation rel,
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
+	int			ski;
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
@@ -591,9 +592,11 @@ _bt_compare(Relation rel,
 	 */
 
 	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(ntupatts >= 1);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
+	ski = 1;
 	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	for (;;)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -640,7 +643,18 @@ _bt_compare(Relation rel,
 		if (result != 0)
 			return result;
 
+		/*
+		 * The loop is deliberately structured in a way that enables the
+		 * compiler to assume that the first iteration always runs.  Testing
+		 * has shown that this avoids a pipeline stall with certain
+		 * memory-bound workloads.  We delay this test, since it depends on
+		 * whether or not caller's tuple is a pivot tuple.  Typically, most
+		 * calls here never reach this far.
+		 */
+		ski++;
 		scankey++;
+		if (ski > ncmpkey)
+			break;
 	}
 
 	/*
-- 
2.17.1

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#15)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Peter Geoghegan <pg@bowt.ie> writes:

It also inlines (in the second patch) by marking the _bt_compare
definition as inline, while not changing anything in nbtree.h. I
believe that this is portable C99 -- let's see what CF Tester thinks
of it.

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers. What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable. I've forgotten the details unfortunately. But why do we need
this, and what exactly are you hoping the compiler will do with it?

regards, tom lane

In reply to: Tom Lane (#16)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Wed, Feb 19, 2020 at 12:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers. What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable. I've forgotten the details unfortunately.

My original approach to inlining was to alter the nbtsearch.c
_bt_compare() callers (the majority) to call _bt_compare_inl(). This
function matches our current _bt_compare() function, except it's a
static inline. A "new" function, _bt_compare(), is also added. That's a
shim function that simply calls _bt_compare_inl().

This earlier approach now seems to work better than the approach I took
in v3. In fact, my overnight testing shows that v3 was a minor loss
for performance. I guess we can scrap the non-static inline thing
right away.

But why do we need
this, and what exactly are you hoping the compiler will do with it?

Well, the patch's approach to inlining prior to v3 was kind of ugly,
and it would have been nice to not have to do it that way. That's all.
Some further refinement is probably possible.

More generally, my goal is to realize a pretty tangible performance
benefit from avoiding a pipeline stall -- this work was driven by a
complaint from Andres. I haven't really explained the reason why the
inlining matters here, and there are a few things that need to be
weighed. I'll have to do some kind of microarchitectural analysis
before proceeding with commit. This is still unsettled.

--
Peter Geoghegan

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#17)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Peter Geoghegan <pg@bowt.ie> writes:

On Wed, Feb 19, 2020 at 12:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers. What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable. I've forgotten the details unfortunately.

My original approach to inlining was to alter the nbtsearch.c
_bt_compare() callers (the majority) to call _bt_compare_inl(). This
function matches our current _bt_compare() function, except it's a
static inline. A "new" function, _bt_compare(), is also added. That's a
shim function that simply calls _bt_compare_inl().

Yeah, that's pretty much the approach we concluded was necessary
for portable results.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#16)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Hi,

On 2020-02-19 15:55:38 -0500, Tom Lane wrote:

Peter Geoghegan <pg@bowt.ie> writes:

It also inlines (in the second patch) by marking the _bt_compare
definition as inline, while not changing anything in nbtree.h. I
believe that this is portable C99 -- let's see what CF Tester thinks
of it.

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers. What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable. I've forgotten the details unfortunately.

I think most of those problems were about putting extern inlines into
headers however - not about putting an inline onto an extern within one
translation unit only. Given that potential fallout should be within a
single file, and can fairly easily be fixed with adding wrappers etc, I
think it's a fairly low risk experiment to see what the buildfarm
thinks.

Greetings,

Andres Freund

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Andres Freund <andres@anarazel.de> writes:

On 2020-02-19 15:55:38 -0500, Tom Lane wrote:

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers. What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable. I've forgotten the details unfortunately.

I think most of those problems were about putting extern inlines into
headers however - not about putting an inline onto an extern within one
translation unit only. Given that potential fallout should be within a
single file, and can fairly easily be fixed with adding wrappers etc, I
think it's a fairly low risk experiment to see what the buildfarm
thinks.

The buildfarm would only tell you if it compiles, not whether the
performance results are what you hoped for.

regards, tom lane

In reply to: Tom Lane (#20)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Wed, Feb 19, 2020 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The buildfarm would only tell you if it compiles, not whether the
performance results are what you hoped for.

Right. Plus I think that more granular control might make more sense
in this particular instance anyway.

--
Peter Geoghegan

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#15)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Peter Geoghegan <pg@bowt.ie> writes:

Attached is v3, which no longer includes the third patch/optimization.
It also inlines (in the second patch) by marking the _bt_compare
definition as inline, while not changing anything in nbtree.h. I
believe that this is portable C99 -- let's see what CF Tester thinks
of it.

The cfbot thinks it doesn't even apply anymore --- conflict with the dedup
patch, no doubt?

regards, tom lane

#23Floris Van Nee
florisvannee@Optiver.com
In reply to: Tom Lane (#22)
2 attachment(s)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

The cfbot thinks it doesn't even apply anymore --- conflict with the dedup
patch, no doubt?

Minor conflict with that patch indeed. Attached is rebased version. I'm running some tests now - will post the results here when finished.

-Floris

Attachments:

v4-0001-Avoid-pipeline-stall-in-_bt_compare.patchapplication/octet-stream; name=v4-0001-Avoid-pipeline-stall-in-_bt_compare.patchDownload
From 9c452649e7db8f8be87dcaf3da4bdee8fec588df Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 18:00:38 -0800
Subject: [PATCH 1/2] Avoid pipeline stall in _bt_compare().

Author: Peter Geoghegan
Reviewed-By: Andres Freund, Floris Van Nee
Discussion: https://postgr.es/m/CAH2-Wzngz5MrkiTaZNny0GzfTxNQE+QWPPaO-C390BgruhgjEw@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7aaa8c17b0..98642cbccb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -655,6 +655,7 @@ _bt_compare(Relation rel,
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
+	int			ski;
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
@@ -687,10 +688,12 @@ _bt_compare(Relation rel,
 	 */
 
 	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(ntupatts >= 1);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+	ski = 1;
 	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	for (;;)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -736,7 +739,18 @@ _bt_compare(Relation rel,
 		if (result != 0)
 			return result;
 
+		/*
+		 * The loop is deliberately structured in a way that enables the
+		 * compiler to assume that the first iteration always runs.  Testing
+		 * has shown that this avoids a pipeline stall with certain
+		 * memory-bound workloads.  We delay this test, since it depends on
+		 * whether or not caller's tuple is a pivot tuple.  Typically, most
+		 * calls here never reach this far.
+		 */
+		ski++;
 		scankey++;
+		if (ski > ncmpkey)
+			break;
 	}
 
 	/*
-- 
2.25.0

v4-0002-Inline-_bt_compare.patchapplication/octet-stream; name=v4-0002-Inline-_bt_compare.patchDownload
From d7f2606a3983e1606823921fd6ff68c2f05d74d8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:35:04 -0800
Subject: [PATCH 2/2] Inline _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 98642cbccb..152dac8bb3 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -645,7 +645,7 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  * key.  See backend/access/nbtree/README for details.
  *----------
  */
-int32
+inline int32
 _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
-- 
2.25.0

In reply to: Floris Van Nee (#23)
2 attachment(s)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Sun, Mar 1, 2020 at 12:15 PM Floris Van Nee <florisvannee@optiver.com> wrote:

Minor conflict with that patch indeed. Attached is rebased version. I'm running some tests now - will post the results here when finished.

Thanks.

We're going to have to go back to my original approach to inlining. At
least, it seemed to be necessary to do that to get any benefit from
the patch on my comparatively modest workstation (using a similar
pgbench SELECT benchmark to the one that you ran). Tom also had a
concern about the portability of inlining without using "static
inline" -- that is another reason to avoid the approach to inlining
taken by v3 + v4.

It's possible (though not very likely) that performance has been
impacted by the deduplication patch (commit 0d861bbb), since it
updated the definition of BTreeTupleGetNAtts() itself.

Attached is v5, which inlines in a targeted fashion, pretty much in
the same way as the earliest version. This is the same as v4 in every
other way. Perhaps you can test this.

--
Peter Geoghegan

Attachments:

v5-0001-Avoid-pipeline-stall-in-_bt_compare.patchapplication/octet-stream; name=v5-0001-Avoid-pipeline-stall-in-_bt_compare.patchDownload
From 82252e4abdceb2716d4bb3a89634570e826b50a3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 18:00:38 -0800
Subject: [PATCH v5 1/2] Avoid pipeline stall in _bt_compare().

Author: Peter Geoghegan
Reviewed-By: Andres Freund, Floris Van Nee
Discussion: https://postgr.es/m/CAH2-Wzngz5MrkiTaZNny0GzfTxNQE+QWPPaO-C390BgruhgjEw@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7aaa8c17b0..98642cbccb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -655,6 +655,7 @@ _bt_compare(Relation rel,
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
 	ItemPointer heapTid;
+	int			ski;
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
@@ -687,10 +688,12 @@ _bt_compare(Relation rel,
 	 */
 
 	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(ntupatts >= 1);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
+	ski = 1;
 	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+	for (;;)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -736,7 +739,18 @@ _bt_compare(Relation rel,
 		if (result != 0)
 			return result;
 
+		/*
+		 * The loop is deliberately structured in a way that enables the
+		 * compiler to assume that the first iteration always runs.  Testing
+		 * has shown that this avoids a pipeline stall with certain
+		 * memory-bound workloads.  We delay this test, since it depends on
+		 * whether or not caller's tuple is a pivot tuple.  Typically, most
+		 * calls here never reach this far.
+		 */
+		ski++;
 		scankey++;
+		if (ski > ncmpkey)
+			break;
 	}
 
 	/*
-- 
2.17.1

v5-0002-Inline-_bt_compare.patchapplication/octet-stream; name=v5-0002-Inline-_bt_compare.patchDownload
From d584cc33669c036b10b23c67f2ea6a7641abfee7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 22 Jan 2020 20:35:04 -0800
Subject: [PATCH v5 2/2] Inline _bt_compare().

---
 src/backend/access/nbtree/nbtsearch.c | 29 +++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 98642cbccb..869e779b84 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,6 +28,8 @@ static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
+static inline int32 _bt_compare_inl(Relation rel, BTScanInsert key, Page page,
+									OffsetNumber offnum, bool isleaf);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -307,7 +309,8 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare_inl(rel, key, page, P_HIKEY,
+												P_ISLEAF(opaque)) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -355,6 +358,7 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -366,6 +370,7 @@ _bt_binsrch(Relation rel,
 
 	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
+	isleaf = P_ISLEAF(opaque);
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -399,7 +404,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid, isleaf);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -414,7 +419,7 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
@@ -512,7 +517,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_inl(rel, key, page, mid, true);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -650,6 +655,18 @@ _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
+{
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	return _bt_compare_inl(rel, key, page, offnum, P_ISLEAF(opaque));
+}
+
+static inline int32
+_bt_compare_inl(Relation rel,
+				BTScanInsert key,
+				Page page,
+				OffsetNumber offnum,
+				bool isleaf)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -667,9 +684,9 @@ _bt_compare(Relation rel,
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
+	 * --- see NOTE above _bt_compare().
 	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+	if (!isleaf && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-- 
2.17.1

#25Floris Van Nee
florisvannee@Optiver.com
In reply to: Peter Geoghegan (#24)
RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Attached is v5, which inlines in a targeted fashion, pretty much in the same
way as the earliest version. This is the same as v4 in every other way.
Perhaps you can test this.

Thank you for the new patch. With the new one I am indeed able to reproduce a performance increase. It is very difficult to reliably reproduce it when running on a large number of cores though, due to the NUMA architecture.
For tests with a small number of cores, I pin all of them to the same node. With that, I see a significant performance increase for v5 compared to master. However, if I pin pgbench to a different node than the node that Postgres is pinned to, this leads to a 20% performance degradation compared to having all of them on the same node, as well as the stddev increasing by a factor of 2 (regardless of patch). With that, it becomes very difficult to see any kind of performance increase due to the patch. For a large number of pgbench workers, I cannot specifically pin the pgbench worker on the same node as the Postgres backend connection it's handling. Leaving it to the OS gives very unreliable results - when I run the 30 workers / 30 connections test, I sometimes see periods of up to 30 minutes where it's doing it 'correctly', but it could also randomly run at the -20% performance for a long time. So far my best bet at explaining this is the NUMA performance hit. I'd like to be able to specifically schedule some Postgres backends to run on one node, while other Postgres backends run on a different node, but this isn't straightforward.

For now, I see no issues with the patch though. However, in real life situations there may be other, more important, optimizations for people that use big multi-node machines.

Thoughts?

-Floris

#26Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Geoghegan (#24)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mar 2, 2020, at 5:29 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Mar 1, 2020 at 12:15 PM Floris Van Nee <florisvannee@optiver.com> wrote:

Minor conflict with that patch indeed. Attached is rebased version. I'm running some tests now - will post the results here when finished.

Thanks.

We're going to have to go back to my original approach to inlining. At
least, it seemed to be necessary to do that to get any benefit from
the patch on my comparatively modest workstation (using a similar
pgbench SELECT benchmark to the one that you ran). Tom also had a
concern about the portability of inlining without using "static
inline" -- that is another reason to avoid the approach to inlining
taken by v3 + v4.

It's possible (though not very likely) that performance has been
impacted by the deduplication patch (commit 0d861bbb), since it
updated the definition of BTreeTupleGetNAtts() itself.

Attached is v5, which inlines in a targeted fashion, pretty much in
the same way as the earliest version. This is the same as v4 in every
other way. Perhaps you can test this.

Hi Peter, just a quick code review:

The v5 patch files apply and pass the regression tests. I get no errors. The performance impact is within the noise. The TPS with the patch are higher sometimes and lower other times, looking across the 27 subtests of the "select-only" benchmark. Which subtest is slower or faster changes per run, so that doesn't seem to be relevant. I ran the "select-only" six times (three with the patch, three without). The change you made to the loop is clearly visible in the nbtsearch.s output, and the change to inline _bt_compare is even more visible, so there doesn't seem to be any defeating of your change due to the compiler ignoring the "inline" or such. I compiled using gcc -O2

% gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

2.4 GHz 8-Core Intel Core i9
32 GB 2667 MHz DDR4

Reading this thread, I think the lack of a performance impact on laptop hardware was expected, but perhaps confirmation that it does not make things worse is useful?

Since this patch doesn't seem to do any harm, I would mark it as "ready for committer", except that there doesn't yet seem to be enough evidence that it is a net win.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Mark Dilger (#26)
Re: RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Status update for a commitfest entry.

This thread was inactive for a while. The latest review suggests that it is Ready For Committer.
I also took a quick look at the patch and agree that it looks sensible. Maybe add a comment before the _bt_compare_inl() to explain the need for this code change.

The new status of this patch is: Ready for Committer

In reply to: Anastasia Lubennikova (#27)
Re: RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mon, Nov 2, 2020 at 9:46 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

This thread was inactive for a while. The latest review suggests that it is Ready For Committer.
I also took a quick look at the patch and agree that it looks sensible. Maybe add a comment before the _bt_compare_inl() to explain the need for this code change.

Actually I am probably going to withdraw this patch soon. The idea is
a plausible way of improving things. But at the same time I cannot
really demonstrate any benefit on hardware that I have access to.

This patch was something I worked on based on a private complaint from
Andres (who is CC'd here now) during an informal conversation at pgDay
SF. If Andres is now in a position to test the patch and can show a
benefit on certain hardware, I may well pick it back up. But as things
stand the evidence in support of the patch is pretty weak. I'm not
going to commit a patch like this unless and until it's crystal clear
what the benefits are.

if Andres cannot spend any time on this in the foreseeable future then
I'll withdraw the patch. I intend to formally withdraw the patch on
November 9th, provided no new information comes to light.

Thanks
--
Peter Geoghegan

In reply to: Mark Dilger (#26)
Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Thu, May 28, 2020 at 12:35 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Reading this thread, I think the lack of a performance impact on laptop hardware was expected, but perhaps confirmation that it does not make things worse is useful?

Since this patch doesn't seem to do any harm, I would mark it as "ready for committer", except that there doesn't yet seem to be enough evidence that it is a net win.

Thank you for testing my patch. Sorry for the delay in getting back to this.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#28)
Re: RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Mon, Nov 2, 2020 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

if Andres cannot spend any time on this in the foreseeable future then
I'll withdraw the patch. I intend to formally withdraw the patch on
November 9th, provided no new information comes to light.

I have now formally withdrawn the patch in the CF app.

Thanks
--
Peter Geoghegan