use CLZ instruction in AllocSetFreeIndex()

Started by John Naylorabout 6 years ago13 messages

john.naylor@2ndquadrant.com

about 6 years ago

3 attachment(s)

Hi all,

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. For other reasons, we
recently added bitutils.h which uses __builtin_clz() where available,
so it makes sense to revisit this. I modified the test in [1]/messages/by-id/407d949e0907201811i13c73e18x58295566d27aadcc@mail.gmail.com (C files
attached), using two separate functions to test CLZ versus the
open-coded algorithm of pg_leftmost_one_pos32().

These are typical results on a recent Intel platform:

HEAD 5.55s
clz 4.51s
open-coded 9.67s

CLZ gives a nearly 20% speed boost on this microbenchmark. I suspect
that this micro-benchmark is actually biased towards the lookup table
more than real-world workloads, because it can monopolize the L1
cache. Sparing cache is possibly the more interesting reason to use
CLZ. The open-coded version is nearly twice as slow, so it makes sense
to keep the current implementation as the default one, and not use
pg_leftmost_one_pos32() directly. However, with a small tweak, we can
reuse the lookup table in bitutils.c instead of the bespoke one used
solely by AllocSetFreeIndex(), saving a couple cache lines here also.
This is done in the attached patch.

[1]: /messages/by-id/407d949e0907201811i13c73e18x58295566d27aadcc@mail.gmail.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchapplication/octet-stream; name=0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchDownload

From 077f88bcbff74d29b64459fcdac3096a28d07b72 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Thu, 26 Dec 2019 18:28:50 -0500
Subject: [PATCH] Use the CLZ instruction in AllocSetFreeIndex()

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. Since 02a6a54ecd6,
we test for availablity of __builtin_clz(), so use that instead. This
is about 20% faster on Intel platforms, but perhaps more importantly
reduces cache pollution caused by the lookup table approach.

In addition, for the open-coded case, use the general-purpose lookup
table added by 02a6a54ecd6, rather than a single-purpose one. This
allows platforms without CLZ to reduce cache pollution as well.
---
 src/backend/utils/mmgr/aset.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index f729d9b6de..137c0b8ee5 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -46,6 +46,7 @@
 
 #include "postgres.h"
 
+#include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -297,18 +298,6 @@ static const MemoryContextMethods AllocSetMethods = {
 #endif
 };
 
-/*
- * Table for AllocSetFreeIndex
- */
-#define LT16(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-
-static const unsigned char LogTable256[256] =
-{
-	0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
-	LT16(5), LT16(6), LT16(6), LT16(7), LT16(7), LT16(7), LT16(7),
-	LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8)
-};
-
 /* ----------
  * Debug macros
  * ----------
@@ -337,8 +326,7 @@ static inline int
 AllocSetFreeIndex(Size size)
 {
 	int			idx;
-	unsigned int t,
-				tsize;
+	unsigned int tsize;
 
 	if (size > (1 << ALLOC_MINBITS))
 	{
@@ -346,15 +334,20 @@ AllocSetFreeIndex(Size size)
 
 		/*
 		 * At this point we need to obtain log2(tsize)+1, ie, the number of
-		 * not-all-zero bits at the right.  We used to do this with a
-		 * shift-and-count loop, but this function is enough of a hotspot to
-		 * justify micro-optimization effort.  The best approach seems to be
-		 * to use a lookup table.  Note that this code assumes that
-		 * ALLOCSET_NUM_FREELISTS <= 17, since we only cope with two bytes of
-		 * the tsize value.
+		 * not-all-zero bits at the right.  We don't use the utility function
+		 * pg_leftmost_one_pos32() here because if CLZ is not available,
+		 * determining the correct shift has a performance penalty.
+		 * By assuming that ALLOCSET_NUM_FREELISTS <= 17, we only need to
+		 * cope with two bytes of the tsize value.
 		 */
+#ifdef HAVE__BUILTIN_CLZ
+		idx = 32 - __builtin_clz((uint32) tsize);
+#else
+		unsigned int t;
 		t = tsize >> 8;
-		idx = t ? LogTable256[t] + 8 : LogTable256[tsize];
+		idx = t ? pg_leftmost_one_pos[t] + 8 : pg_leftmost_one_pos[tsize];
+		idx += 1;
+#endif
 
 		Assert(idx < ALLOCSET_NUM_FREELISTS);
 	}
-- 
2.22.0

test_allocsetfreeindex.capplication/octet-stream; name=test_allocsetfreeindex.cDownload

test_allocsetfreeindex2.capplication/octet-stream; name=test_allocsetfreeindex2.cDownload

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: John Naylor (#1)

Re: use CLZ instruction in AllocSetFreeIndex()

On 2019-Dec-26, John Naylor wrote:

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. For other reasons, we
recently added bitutils.h which uses __builtin_clz() where available,
so it makes sense to revisit this. I modified the test in [1] (C files
attached), using two separate functions to test CLZ versus the
open-coded algorithm of pg_leftmost_one_pos32().

These are typical results on a recent Intel platform:

HEAD 5.55s
clz 4.51s
open-coded 9.67s

I can confirm these results on my Intel laptop. I ran it with a
repetition of 20, averages of 4 runs:

clz 1,614
bitutils 3,714
current 2,088
(stddevs are under 0,031).

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Alvaro Herrera (#2)

Re: use CLZ instruction in AllocSetFreeIndex()

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Dec-26, John Naylor wrote:

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. For other reasons, we
recently added bitutils.h which uses __builtin_clz() where available,
so it makes sense to revisit this. I modified the test in [1] (C files
attached), using two separate functions to test CLZ versus the
open-coded algorithm of pg_leftmost_one_pos32().

I can confirm these results on my Intel laptop. I ran it with a
repetition of 20, averages of 4 runs:

I tried this on a few other architectures --- ppc32, aarch64, and x86
(not 64). The general contours of the result are the same on all,
eg here's the results on aarch64 (Fedora 28):

$ ./a.out 100
...
clz 22.713s
bitutils func 59.462s
current 30.630s

This kind of leads me to wonder if we don't need to expend more
effort on the non-CLZ version of pg_leftmost_one_pos32; it seems
like it shouldn't be losing this badly to the only-slightly-
improved logic that's currently in AllocSetFreeIndex. On the
other hand, the buildfarm thinks that __builtin_clz is essentially
universal these days --- the only active non-MSVC critter that
reports not having it is anole. So maybe it's not worth sweating
over that. Perhaps what we really ought to be working on is
finding MSVC equivalents for __builtin_clz and friends.

Anyway, getting back to the presented patch, I find myself a bit
dissatisfied with it because it seems like it's leaving something
on the table. Specifically, looking at the generated assembly
code on a couple of architectures, the setup logic generated by

tsize = (size - 1) >> ALLOC_MINBITS;

looks like it costs as much or more as the clz proper. I'm not
sure we can get rid of the subtract-one step, but couldn't the
right shift be elided in favor of changing the constant we
subtract clz's result from? Shifting off those bits separately
made sense in the old implementation, but assuming that CLZ is
more or less constant-time, it doesn't with CLZ.

regards, tom lane

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#3)

Re: use CLZ instruction in AllocSetFreeIndex()

On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, getting back to the presented patch, I find myself a bit
dissatisfied with it because it seems like it's leaving something
on the table. Specifically, looking at the generated assembly
code on a couple of architectures, the setup logic generated by

tsize = (size - 1) >> ALLOC_MINBITS;

looks like it costs as much or more as the clz proper. I'm not
sure we can get rid of the subtract-one step,

As I understand it, the desired outcome is ceil(log2(size)), which can
be computed by clz(size - 1) + 1.

but couldn't the
right shift be elided in favor of changing the constant we
subtract clz's result from? Shifting off those bits separately
made sense in the old implementation, but assuming that CLZ is
more or less constant-time, it doesn't with CLZ.

That makes sense -- I'll look into doing that.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: John Naylor (#4)

Re: use CLZ instruction in AllocSetFreeIndex()

John Naylor <john.naylor@2ndquadrant.com> writes:

On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... but couldn't the
right shift be elided in favor of changing the constant we
subtract clz's result from? Shifting off those bits separately
made sense in the old implementation, but assuming that CLZ is
more or less constant-time, it doesn't with CLZ.

That makes sense -- I'll look into doing that.

Actually, we could apply that insight to both code paths.
In the existing path, that requires assuming
ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK.
(Nowadays I'd probably add a StaticAssert about that.)

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#3)

Re: use CLZ instruction in AllocSetFreeIndex()

On 2019-Dec-27, Tom Lane wrote:

This kind of leads me to wonder if we don't need to expend more
effort on the non-CLZ version of pg_leftmost_one_pos32; it seems
like it shouldn't be losing this badly to the only-slightly-
improved logic that's currently in AllocSetFreeIndex. On the
other hand, the buildfarm thinks that __builtin_clz is essentially
universal these days --- the only active non-MSVC critter that
reports not having it is anole. So maybe it's not worth sweating
over that. Perhaps what we really ought to be working on is
finding MSVC equivalents for __builtin_clz and friends.

Apparently clz() can be written using _BitScanReverse(), per
https://stackoverflow.com/a/20468180
https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Fetter

david@fetter.org

about 6 years ago

In reply to: Alvaro Herrera (#6)

Re: use CLZ instruction in AllocSetFreeIndex()

On Fri, Dec 27, 2019 at 01:29:47PM -0300, Alvaro Herrera wrote:

On 2019-Dec-27, Tom Lane wrote:

This kind of leads me to wonder if we don't need to expend more
effort on the non-CLZ version of pg_leftmost_one_pos32; it seems
like it shouldn't be losing this badly to the only-slightly-
improved logic that's currently in AllocSetFreeIndex. On the
other hand, the buildfarm thinks that __builtin_clz is essentially
universal these days --- the only active non-MSVC critter that
reports not having it is anole. So maybe it's not worth sweating
over that. Perhaps what we really ought to be working on is
finding MSVC equivalents for __builtin_clz and friends.

Apparently clz() can be written using _BitScanReverse(), per
https://stackoverflow.com/a/20468180
https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015

There's also various flavors of lczntN for N in (16,32,64) on
(relatively) modern architectures.

https://docs.microsoft.com/en-us/cpp/intrinsics/lzcnt16-lzcnt-lzcnt64?redirectedfrom=MSDN&view=vs-2019

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Alvaro Herrera (#6)

Re: use CLZ instruction in AllocSetFreeIndex()

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2019-Dec-27, Tom Lane wrote:

... Perhaps what we really ought to be working on is
finding MSVC equivalents for __builtin_clz and friends.

Apparently clz() can be written using _BitScanReverse(), per
https://stackoverflow.com/a/20468180
https://docs.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64?view=vs-2015

Yeah, I found that too. It looks promising, but we need to look into
* portability to different MSVC versions? (I guess the buildfarm would
tell us)
* performance, does it actually generate comparable code?
* intrinsics for the other bit instructions we use?

regards, tom lane

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Tom Lane (#5)

2 attachment(s)

Re: use CLZ instruction in AllocSetFreeIndex()

On Fri, Dec 27, 2019 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... but couldn't the
right shift be elided in favor of changing the constant we
subtract clz's result from? Shifting off those bits separately
made sense in the old implementation, but assuming that CLZ is
more or less constant-time, it doesn't with CLZ.

That makes sense -- I'll look into doing that.

Actually, we could apply that insight to both code paths.
In the existing path, that requires assuming
ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK.
(Nowadays I'd probably add a StaticAssert about that.)

I tried that in the attached files and got these results:

current 6.14s
clz 4.52s
clz no right shift 3.15s
lookup table 5.56s
lookup table no right shift 7.34s

Here, "lookup table" refers to using the pg_leftmost_one_pos[] array
and incrementing the result. Removing the shift operation from the CLZ
case is clearly an improvement, and the main body goes from

movabsq $34359738367, %rax
addq %rax, %rdi
shrq $3, %rdi
bsrl %edi, %eax
xorl $-32, %eax
addl $33, %eax

decl %edi
bsrl %edi, %eax
xorl $-32, %eax
addl $30, %eax

The lookup table case is less clear. Removing the shift results in
assembly that looks more like the C code and is slower for me. The
standard lookup table code uses some magic constants and does its own
constant folding by shifting 11 (8 + 3). In the absence of testing on
platforms that will actually exercise this path, it seems the
open-coded path should keep the shift for now. Thoughts?

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

David Fetter

david@fetter.org

about 6 years ago

In reply to: John Naylor (#9)

Re: use CLZ instruction in AllocSetFreeIndex()

On Fri, Dec 27, 2019 at 07:02:02PM -0500, John Naylor wrote:

On Fri, Dec 27, 2019 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <john.naylor@2ndquadrant.com> writes:

On Fri, Dec 27, 2019 at 9:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... but couldn't the
right shift be elided in favor of changing the constant we
subtract clz's result from? Shifting off those bits separately
made sense in the old implementation, but assuming that CLZ is
more or less constant-time, it doesn't with CLZ.

That makes sense -- I'll look into doing that.

Actually, we could apply that insight to both code paths.
In the existing path, that requires assuming
ALLOCSET_NUM_FREELISTS+ALLOC_MINBITS <= 17, but that's OK.
(Nowadays I'd probably add a StaticAssert about that.)

I tried that in the attached files and got these results:

current 6.14s
clz 4.52s
clz no right shift 3.15s
lookup table 5.56s
lookup table no right shift 7.34s

Here, "lookup table" refers to using the pg_leftmost_one_pos[] array
and incrementing the result. Removing the shift operation from the CLZ
case is clearly an improvement, and the main body goes from

movabsq $34359738367, %rax
addq %rax, %rdi
shrq $3, %rdi
bsrl %edi, %eax
xorl $-32, %eax
addl $33, %eax

to

decl %edi
bsrl %edi, %eax
xorl $-32, %eax
addl $30, %eax

The lookup table case is less clear. Removing the shift results in
assembly that looks more like the C code and is slower for me. The
standard lookup table code uses some magic constants and does its own
constant folding by shifting 11 (8 + 3). In the absence of testing on
platforms that will actually exercise this path, it seems the
open-coded path should keep the shift for now. Thoughts?

It's probably worth doing the things you've found unambiguous gains
for as a patch, putting it on the next commitfest, and seeing what the
commitfest.cputube.org machinery has to say about it.

Maybe it'd be worth trying out a patch that enables CLZ for Windows,
but that seems like a separate issue.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#11

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: David Fetter (#10)

1 attachment(s)

Re: use CLZ instruction in AllocSetFreeIndex()

On Fri, Dec 27, 2019 at 9:16 PM David Fetter <david@fetter.org> wrote:

On Fri, Dec 27, 2019 at 07:02:02PM -0500, John Naylor wrote:

The lookup table case is less clear. Removing the shift results in
assembly that looks more like the C code and is slower for me. The
standard lookup table code uses some magic constants and does its own
constant folding by shifting 11 (8 + 3). In the absence of testing on
platforms that will actually exercise this path, it seems the
open-coded path should keep the shift for now. Thoughts?

It's probably worth doing the things you've found unambiguous gains
for as a patch, putting it on the next commitfest, and seeing what the
commitfest.cputube.org machinery has to say about it.

Done in the attached.

Maybe it'd be worth trying out a patch that enables CLZ for Windows,
but that seems like a separate issue.

Agreed.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchapplication/octet-stream; name=v2-0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchDownload

From e3a1be4a356434b10007196e85136ed4f9501304 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Thu, 26 Dec 2019 18:28:50 -0500
Subject: [PATCH v2] Use the CLZ instruction in AllocSetFreeIndex()

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. Since 02a6a54ecd6,
we test for availablity of __builtin_clz(), so use that instead. This
is about 50% faster on Intel platforms, but perhaps more importantly
eliminates cache pollution caused by the lookup table approach.

In addition, for the open-coded case, use the general-purpose lookup
table added by 02a6a54ecd6, rather than a single-purpose one.
---
 src/backend/utils/mmgr/aset.c | 50 ++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index f729d9b6de..bc05ea906a 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -46,6 +46,7 @@
 
 #include "postgres.h"
 
+#include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -297,18 +298,6 @@ static const MemoryContextMethods AllocSetMethods = {
 #endif
 };
 
-/*
- * Table for AllocSetFreeIndex
- */
-#define LT16(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-
-static const unsigned char LogTable256[256] =
-{
-	0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
-	LT16(5), LT16(6), LT16(6), LT16(7), LT16(7), LT16(7), LT16(7),
-	LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8)
-};
-
 /* ----------
  * Debug macros
  * ----------
@@ -337,24 +326,37 @@ static inline int
 AllocSetFreeIndex(Size size)
 {
 	int			idx;
-	unsigned int t,
-				tsize;
 
 	if (size > (1 << ALLOC_MINBITS))
 	{
-		tsize = (size - 1) >> ALLOC_MINBITS;
-
 		/*
-		 * At this point we need to obtain log2(tsize)+1, ie, the number of
-		 * not-all-zero bits at the right.  We used to do this with a
-		 * shift-and-count loop, but this function is enough of a hotspot to
-		 * justify micro-optimization effort.  The best approach seems to be
-		 * to use a lookup table.  Note that this code assumes that
-		 * ALLOCSET_NUM_FREELISTS <= 17, since we only cope with two bytes of
-		 * the tsize value.
+		 * At this point we compute ceil(log2(size >> ALLOC_MINBITS)).
+		 * We can do this quickly via the number of not-all-zero bits at
+		 * the right.  We could simply use
+		 *
+		 * pg_leftmost_one_pos32((size - 1) >> ALLOC_MINBITS) + 1
+		 *
+		 * for this, but duplicating the logic here affords us additional
+		 * optimizations:
+		 *
+		 * 1. The compiler can fold ALLOC_MINBITS into other constants,
+		 *    rather than right-shifting as a separate step.
+		 * 2. In the open-coded case, we only need to cope with two
+		 *    bytes of the size value.
 		 */
+#ifdef HAVE__BUILTIN_CLZ
+		idx = 31 - __builtin_clz((uint32) size - 1) - ALLOC_MINBITS + 1;
+#else
+		unsigned int t,
+					tsize;
+
+		StaticAssertStmt(ALLOCSET_NUM_FREELISTS + ALLOC_MINBITS <= 17, "");
+
+		tsize = (size - 1) >> ALLOC_MINBITS;
 		t = tsize >> 8;
-		idx = t ? LogTable256[t] + 8 : LogTable256[tsize];
+		idx = t ? pg_leftmost_one_pos[t] + 8 : pg_leftmost_one_pos[tsize];
+		idx += 1;
+#endif
 
 		Assert(idx < ALLOCSET_NUM_FREELISTS);
 	}
-- 
2.22.0

#12

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: John Naylor (#11)

1 attachment(s)

Re: use CLZ instruction in AllocSetFreeIndex()

v2 had an Assert that was only correct while experimenting with
eliding right shift. Fixed in v3.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchapplication/octet-stream; name=v3-0001-Use-the-CLZ-instruction-in-AllocSetFreeIndex.patchDownload

From e0111ed0b75e824f46c3b0bf5322a9bd355d100c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Thu, 26 Dec 2019 18:28:50 -0500
Subject: [PATCH v3] Use the CLZ instruction in AllocSetFreeIndex()

In commit ab5b4e2f9ed, we optimized AllocSetFreeIndex() using a lookup
table. At the time, using CLZ was rejected because compiler/platform
support was not widespread enough to justify it. Since 02a6a54ecd6,
we test for availablity of __builtin_clz(), so use that instead. This
is about twice as fast on Intel platforms, but perhaps more importantly
eliminates cache pollution caused by the lookup table approach.

In addition, for the open-coded case, use the general-purpose lookup
table added by 02a6a54ecd6, rather than a single-purpose one.
---
 src/backend/utils/mmgr/aset.c | 50 ++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index f729d9b6de..a89ef7084b 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -46,6 +46,7 @@
 
 #include "postgres.h"
 
+#include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -297,18 +298,6 @@ static const MemoryContextMethods AllocSetMethods = {
 #endif
 };
 
-/*
- * Table for AllocSetFreeIndex
- */
-#define LT16(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
-
-static const unsigned char LogTable256[256] =
-{
-	0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4,
-	LT16(5), LT16(6), LT16(6), LT16(7), LT16(7), LT16(7), LT16(7),
-	LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8), LT16(8)
-};
-
 /* ----------
  * Debug macros
  * ----------
@@ -337,24 +326,37 @@ static inline int
 AllocSetFreeIndex(Size size)
 {
 	int			idx;
-	unsigned int t,
-				tsize;
 
 	if (size > (1 << ALLOC_MINBITS))
 	{
-		tsize = (size - 1) >> ALLOC_MINBITS;
-
 		/*
-		 * At this point we need to obtain log2(tsize)+1, ie, the number of
-		 * not-all-zero bits at the right.  We used to do this with a
-		 * shift-and-count loop, but this function is enough of a hotspot to
-		 * justify micro-optimization effort.  The best approach seems to be
-		 * to use a lookup table.  Note that this code assumes that
-		 * ALLOCSET_NUM_FREELISTS <= 17, since we only cope with two bytes of
-		 * the tsize value.
+		 * At this point we compute ceil(log2(size >> ALLOC_MINBITS)).
+		 * We can do this quickly via the number of not-all-zero bits at
+		 * the right.  We could simply use
+		 *
+		 * pg_leftmost_one_pos32((size - 1) >> ALLOC_MINBITS) + 1
+		 *
+		 * for this, but duplicating the logic here affords us additional
+		 * optimizations:
+		 *
+		 * 1. The compiler can fold ALLOC_MINBITS into other constants,
+		 *    rather than right-shifting as a separate step.
+		 * 2. In the open-coded case, we only need to cope with two
+		 *    bytes of the size value.
 		 */
+#ifdef HAVE__BUILTIN_CLZ
+		idx = 31 - __builtin_clz((uint32) size - 1) - ALLOC_MINBITS + 1;
+#else
+		unsigned int t,
+					tsize;
+
+		StaticAssertStmt(ALLOCSET_NUM_FREELISTS <= 17, "");
+
+		tsize = (size - 1) >> ALLOC_MINBITS;
 		t = tsize >> 8;
-		idx = t ? LogTable256[t] + 8 : LogTable256[tsize];
+		idx = t ? pg_leftmost_one_pos[t] + 8 : pg_leftmost_one_pos[tsize];
+		idx += 1;
+#endif
 
 		Assert(idx < ALLOCSET_NUM_FREELISTS);
 	}
-- 
2.22.0

#13

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: John Naylor (#12)

1 attachment(s)

Re: use CLZ instruction in AllocSetFreeIndex()

John Naylor <john.naylor@2ndquadrant.com> writes:

v2 had an Assert that was only correct while experimenting with
eliding right shift. Fixed in v3.

I think there must have been something wrong with your test that
said that eliminating the right shift from the non-CLZ code made
it slower. It should be an unconditional win, just as it is for
the CLZ code path. (Maybe some odd cache-line-boundary effect?)

Also, I think it's just weird to account for ALLOC_MINBITS one
way in the CLZ path and the other way in the other path.

I decided that it might be a good idea to do performance testing
in-place rather than in a standalone test program. I whipped up
the attached that just does a bunch of palloc/pfree cycles.
I got the following results on a non-cassert build (medians of
a number of tests; the times are repeatable to ~ 0.1% for me):

HEAD: 2429.431 ms
v3 CLZ: 2131.735 ms
v3 non-CLZ: 2477.835 ms
remove shift: 2266.755 ms

I didn't bother to try this on non-x86_64 architectures, as
previous testing convinces me the outcome should be about the
same.

Hence, pushed that way, with a bit of additional cosmetic foolery:
the static assertion made more sense to me in relation to the
documented assumption that size <= ALLOC_CHUNK_LIMIT, and I
thought the comment could use some work.

regards, tom lane