our use of popcount
While reviewing the patch to speed up Gist indexes with tsvectors [1]https://commitfest.postgresql.org/32/3023/ by
smarter use of popcount, I was reminded that for hardware popcount, we do
an indirect function call to execute a single CPU instruction, one word at
a time. I went ahead and did some microbenchmarks to see how much that buys
us, and if we could do better.
For the tests below I used the attached C file compiled into a shared
library with gcc 8.4, measuring popcount on an 8kB buffer, median of 3 runs:
select drive_popcount64(1000000, 1024);
select drive_popcount(1000000, 1024);
Note: pg_popcount64() is passed one 8-byte word, and pg_popcount() is
passed a buffer, which if properly aligned allows to call the appropriate
word-at-a-time popcount function.
master -- indirect call to pg_popcount64_asm(), where available. Note
that passing a buffer is faster:
pg_popcount64()
2680ms
pg_popcount()
1640ms
0001 ignores assembly and uses direct calls to popcount64_slow(). It is
quite a bit slower, but notice that passing a buffer to pg_popcont() with
the slow implementation is not all that much slower than calling the
indirect assembly one word at a time (2680ms vs 3190ms):
pg_popcount64()
4930ms
pg_popcount()
3190ms
It's also worth noting that currently all platforms use an indirect
function call, regardless of instruction support, so the direct call here
is a small win for non-x86-64 platforms.
0002 restores indirection through a function pointer, but chooses the
pass-a-buffer function rather than the retail function, and if assembly is
available, calls inlined popcount64_asm(). This in itself is not much
faster than master (1640ms vs 1430ms):
pg_popcount()
1430ms
However, 0003 takes the advice of [2]https://danluu.com/assembly-intrinsics/ and [3]https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati and uses hand-written
unrolled assembly, and now we actually have some real improvement, over 3x
faster than master:
pg_popcount()
494ms
0004 shows how it would look to use the buffer-passing version for
bitmapsets and the visibility map. Bitmapsets would benefit from this even
on master, going by the test results above. If we went with something like
0003, bitmapsets would benefit further automatically. However, counting of
visibility map bits is a bit more tricky, since we have to mask off the
visible bits and frozen bits to count them separately. 0004 has a simple
unoptimized function to demonstrate. As I recall, the VM code did not
benefit much from the popcount infrastructure to begin with, so a small
regression might not be noticeable. If it is, it's just a SMOP to offer an
assembly version here also.
The motivation for benchmarking this was to have some concrete numbers in
mind while reviewing gist index creation. If the gist patch can benefit
from any of the above, it's worth considering, as a more holistic approach.
Since it affects other parts of the system, I wanted to share this in a new
thread first.
[1]: https://commitfest.postgresql.org/32/3023/
[2]: https://danluu.com/assembly-intrinsics/
[3]: https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati
https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v1-0001-Use-direct-function-calls-for-pg_popcount-32-64.patchapplication/octet-stream; name=v1-0001-Use-direct-function-calls-for-pg_popcount-32-64.patchDownload
From ad5b6a5555f5d2ffa5c717a72da383781a777960 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Tue, 2 Mar 2021 12:03:06 -0400
Subject: [PATCH v1 1/4] Use direct function calls for pg_popcount{32,64}().
Rather than indirect the call to different implementations, Use the
former popcount{32,64}_slow() for all cases.
---
src/include/port/pg_bitutils.h | 4 +-
src/port/pg_bitutils.c | 69 +++-------------------------------
2 files changed, 8 insertions(+), 65 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index f9b77ec278..2c40784830 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -208,8 +208,8 @@ pg_ceil_log2_64(uint64 num)
}
/* Count the number of one-bits in a uint32 or uint64 */
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
+extern int pg_popcount32 (uint32 word);
+extern int pg_popcount64 (uint64 word);
/* Count the number of one-bits in a byte array */
extern uint64 pg_popcount(const char *buf, int bytes);
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2252021854..5dab793e49 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -116,23 +116,6 @@ const uint8 pg_number_of_ones[256] = {
#endif
#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#ifdef USE_POPCNT_ASM
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_asm(uint32 word);
-static int pg_popcount64_asm(uint64 word);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#else
-int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif /* USE_POPCNT_ASM */
-
#ifdef USE_POPCNT_ASM
/*
@@ -154,46 +137,6 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_asm;
- pg_popcount64 = pg_popcount64_asm;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_asm;
- pg_popcount64 = pg_popcount64_asm;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
/*
* pg_popcount32_asm
* Return the number of 1 bits set in word
@@ -224,11 +167,11 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
/*
- * pg_popcount32_slow
+ * pg_popcount32
* Return the number of 1 bits set in word
*/
-static int
-pg_popcount32_slow(uint32 word)
+int
+pg_popcount32(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
return __builtin_popcount(word);
@@ -246,11 +189,11 @@ pg_popcount32_slow(uint32 word)
}
/*
- * pg_popcount64_slow
+ * pg_popcount64
* Return the number of 1 bits set in word
*/
-static int
-pg_popcount64_slow(uint64 word)
+int
+pg_popcount64(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
#if defined(HAVE_LONG_INT_64)
--
2.22.0
v1-0002-Use-platform-specific-implementations-of-pg_popco.patchapplication/octet-stream; name=v1-0002-Use-platform-specific-implementations-of-pg_popco.patchDownload
From f0ce8884c0588a89dff01ba8a9e989f6284ba495 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Tue, 2 Mar 2021 12:24:49 -0400
Subject: [PATCH v1 2/4] Use platform-specific implementations of pg_popcount.
Since this function takes a buffer that's likely much longer than
the word size, we can better amortize the cost of an indirect
function call. For the asm version, use the existing
pg_popcount{32,64}_asm functions, but inline them.
---
src/include/port/pg_bitutils.h | 2 +-
src/port/pg_bitutils.c | 87 ++++++++++++++++++++++++++++++++--
2 files changed, 84 insertions(+), 5 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 2c40784830..708b9a6a67 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -212,7 +212,7 @@ extern int pg_popcount32 (uint32 word);
extern int pg_popcount64 (uint64 word);
/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 (*pg_popcount) (const char *buf, int bytes);
/*
* Rotate the bits of "word" to the right by n bits.
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5dab793e49..9be8b78cff 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -141,7 +141,7 @@ pg_popcount_available(void)
* pg_popcount32_asm
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount32_asm(uint32 word)
{
uint32 res;
@@ -154,7 +154,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount64_asm
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_asm(uint64 word)
{
uint64 res;
@@ -165,6 +165,18 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif /* USE_POPCNT_ASM */
+static uint64 pg_popcount_slow(const char *buf, int bytes);
+
+#ifdef USE_POPCNT_ASM
+static bool pg_popcount_available(void);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_asm(const char *buf, int bytes);
+
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+#else
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_slow;
+#endif /* USE_POPCNT_ASM */
+
/*
* pg_popcount32
@@ -216,13 +228,30 @@ pg_popcount64(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
+/*
+ * This function gets called on the first call to pg_popcount.
+ * It detects whether we can use the asm implementation, and replace
+ * the function pointer so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ if (pg_popcount_available())
+ pg_popcount = pg_popcount_asm;
+ else
+ pg_popcount = pg_popcount_slow;
+
+ return pg_popcount(buf, bytes);
+}
+
/*
- * pg_popcount
+ * pg_popcount_slow
* Returns the number of 1-bits in buf
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -262,3 +291,53 @@ pg_popcount(const char *buf, int bytes)
return popcnt;
}
+
+#ifdef USE_POPCNT_ASM
+
+/*
+ * pg_popcount_asm
+ * Returns the number of 1-bits in buf using POPCNT
+ */
+uint64
+pg_popcount_asm(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_asm(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_asm(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
+#endif /* USE_POPCNT_ASM */
--
2.22.0
v1-0003-Manually-unroll-the-loop-with-hand-written-assemb.patchapplication/octet-stream; name=v1-0003-Manually-unroll-the-loop-with-hand-written-assemb.patchDownload
From 2a1fd5be06be64dcce6a3701c37337c2570a9661 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Tue, 2 Mar 2021 14:17:34 -0400
Subject: [PATCH v1 3/4] Manually unroll the loop with hand-written assembly.
Note: For demonstration, only done for 64-bit platforms.
---
src/port/pg_bitutils.c | 59 ++++++++++++++++++++++--------------------
1 file changed, 31 insertions(+), 28 deletions(-)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 9be8b78cff..33882fe175 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -137,32 +137,6 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * pg_popcount32_asm
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount32_asm(uint32 word)
-{
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-}
-
-/*
- * pg_popcount64_asm
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount64_asm(uint64 word)
-{
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-}
-
#endif /* USE_POPCNT_ASM */
static uint64 pg_popcount_slow(const char *buf, int bytes);
@@ -308,10 +282,39 @@ pg_popcount_asm(const char *buf, int bytes)
if (buf == (const char *) TYPEALIGN(8, buf))
{
const uint64 *words = (const uint64 *) buf;
+ uint64 cum[4] = {0};
+
+ /*
+ * Manually unroll the loop, since the compiler won't do it for us.
+ * This is in hand-written assembly to prevent some compilers from
+ * inserting useless instructions to work around a False Data
+ * Dependency in the POPCNT instruction.
+ *
+ * Based on code found at http://danluu.com/assembly-intrinsics/
+ */
+ while (bytes >= 4 * 8)
+ {
+__asm__ __volatile__(
+ "popcnt %4, %4 \n\t"
+ "add %4, %0 \n\t"
+ "popcnt %5, %5 \n\t"
+ "add %5, %1 \n\t"
+ "popcnt %6, %6 \n\t"
+ "add %6, %2 \n\t"
+ "popcnt %7, %7 \n\t"
+ "add %7, %3 \n\t"
+ : "+r" (cum[0]), "+r" (cum[1]), "+r" (cum[2]), "+r" (cum[3])
+ : "r" (words[0]), "r" (words[1]), "r" (words[2]), "r" (words[3]));
+
+ bytes -= 4 * 8;
+ words += 4;
+ }
+
+ popcnt += cum[0] + cum[1] + cum[2] + cum[3];
while (bytes >= 8)
{
- popcnt += pg_popcount64_asm(*words++);
+ popcnt += pg_popcount64(*words++);
bytes -= 8;
}
@@ -325,7 +328,7 @@ pg_popcount_asm(const char *buf, int bytes)
while (bytes >= 4)
{
- popcnt += pg_popcount32_asm(*words++);
+ popcnt += pg_popcount32(*words++);
bytes -= 4;
}
--
2.22.0
v1-0004-Use-pg_popcount-rather-than-pg_popcount-32-64-whe.patchapplication/octet-stream; name=v1-0004-Use-pg_popcount-rather-than-pg_popcount-32-64-whe.patchDownload
From 5fdee6317bbd156a2ac48739562be109a618ccfe Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Wed, 3 Mar 2021 12:12:51 -0400
Subject: [PATCH v1 4/4] Use pg_popcount(), rather than pg_popcount{32,64}()
where possible.
Passing a buffer is faster than passing individual words.
For the visibility map, this requires inventing a new function,
which is for demonstration and not optimized.
---
src/backend/access/heap/visibilitymap.c | 12 +++---------
src/backend/nodes/bitmapset.c | 23 ++++-------------------
src/port/pg_bitutils.c | 22 ++++++++++++++++++++++
3 files changed, 29 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d8..8747e69a2c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -409,17 +409,11 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
"unsupported MAPSIZE");
if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
+ nvisible = pg_popcount_mask64(map, MAPSIZE, VISIBLE_MASK64);
else
{
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
+ nvisible = pg_popcount_mask64(map, MAPSIZE, VISIBLE_MASK64);
+ nfrozen = pg_popcount_mask64(map, MAPSIZE, FROZEN_MASK64);
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 649478b0d4..5368c72255 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -452,7 +452,6 @@ bms_is_member(int x, const Bitmapset *a)
int
bms_member_index(Bitmapset *a, int x)
{
- int i;
int bitnum;
int wordnum;
int result = 0;
@@ -466,14 +465,8 @@ bms_member_index(Bitmapset *a, int x)
bitnum = BITNUM(x);
/* count bits in preceding words */
- for (i = 0; i < wordnum; i++)
- {
- bitmapword w = a->words[i];
-
- /* No need to count the bits in a zero word */
- if (w != 0)
- result += bmw_popcount(w);
- }
+ result = pg_popcount((const char *) a->words,
+ wordnum * BITS_PER_BITMAPWORD / BITS_PER_BYTE);
/*
* Now add bits of the last word, but only those before the item. We can
@@ -645,22 +638,14 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
int
bms_num_members(const Bitmapset *a)
{
- int result = 0;
int nwords;
- int wordnum;
if (a == NULL)
return 0;
nwords = a->nwords;
- for (wordnum = 0; wordnum < nwords; wordnum++)
- {
- bitmapword w = a->words[wordnum];
- /* No need to count the bits in a zero word */
- if (w != 0)
- result += bmw_popcount(w);
- }
- return result;
+ return pg_popcount((const char *) a->words,
+ nwords * BITS_PER_BITMAPWORD / BITS_PER_BYTE);
}
/*
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 33882fe175..9e3251adc8 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -344,3 +344,25 @@ __asm__ __volatile__(
}
#endif /* USE_POPCNT_ASM */
+
+/*
+ * pg_popcount_mask64
+ * Returns the number of 1-bits in buf after applying the mask
+ *
+ * Note: We currently only need a 64-bit version for the visibility map.
+ */
+uint64
+pg_popcount_mask64(const uint64 *buf, int bytes, uint64 mask)
+{
+ uint64 popcnt = 0;
+
+ Assert(bytes % sizeof(uint64) == 0);
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64(*words++ & mask);
+ bytes -= 8;
+ }
+
+ return popcnt;
+}
--
2.22.0