Popcount optimization for the slow-path lookups

Started by Andrew Pogrebnoiabout 1 month ago6 messages

andrew.pogrebnoi@percona.com

about 1 month ago

1 attachment(s)

Hello hackers,

I want to propose an optimization for pg_popcount32_slow() and
pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
compared to the current, looped version. This is on x86. It is much more
significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if
SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from
pg_popcount32/64_slow(), measuring them against branchless counterparts.
Then in C++ I pre generated two static vectors with random uint64's and
uint32's, each of 5M size and ran lookups using google/benchmark [0]https://github.com/google/benchmark. I've
run it on M1 and x86 Macs.

Results:

X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```

Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```

I also tried the Parallel summing of adjacent bits (see Hacker's Delight
[1]: https://dl.acm.org/doi/book/10.5555/2462741
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...

static inline int
pg_popcount_word64(uint64 word)
{
word = word - ((word >> 1) & POPCOUNT_M0);
word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
word = ((word >> 4) + word) & POPCOUNT_M2;
word += word >> 8;
word += word >> 16;
word += word >> 32; // this step is omitted for the uint32 version

return word & 0x3F;
}
```

However, it doesn't show any improvements compared to the branchless look
up on M1 and a slight gain for uint64 on x86.

The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```

[0]: https://github.com/google/benchmark
[1]: https://dl.acm.org/doi/book/10.5555/2462741

-----
Cheers,
Andy

Attachments:

0001-Optimize-pg_popcount32-64_slow-lookups.patchapplication/octet-stream; name=0001-Optimize-pg_popcount32-64_slow-lookups.patchDownload

From a8d1105d90ce4d1e04626d7680f7338f04704c14 Mon Sep 17 00:00:00 2001
From: Andrew Pogrebnoi <absourd.noise@gmail.com>
Date: Fri, 5 Dec 2025 10:45:52 +0200
Subject: [PATCH] Optimize pg_popcount32/64_slow() lookups

This commit makes lookups into pg_number_of_ones[] branchless. Which optimizes the slow-path of words' popcount.
---
 contrib/pg_tde/src/libkmip |  1 +
 src/port/pg_bitutils.c     | 30 ++++++++++++------------------
 2 files changed, 13 insertions(+), 18 deletions(-)
 create mode 160000 contrib/pg_tde/src/libkmip

diff --git a/contrib/pg_tde/src/libkmip b/contrib/pg_tde/src/libkmip
new file mode 160000
index 00000000000..f3f21ceb32b
--- /dev/null
+++ b/contrib/pg_tde/src/libkmip
@@ -0,0 +1 @@
+Subproject commit f3f21ceb32bef8ce8fb36e25c6ae4831f7689e02
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..4c482c5f218 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -350,15 +350,10 @@ pg_popcount32_slow(uint32 word)
 #ifdef HAVE__BUILTIN_POPCOUNT
 	return __builtin_popcount(word);
 #else							/* !HAVE__BUILTIN_POPCOUNT */
-	int			result = 0;
-
-	while (word != 0)
-	{
-		result += pg_number_of_ones[word & 255];
-		word >>= 8;
-	}
-
-	return result;
+	return pg_number_of_ones[word >> 24] +
+		   pg_number_of_ones[(word >> 16) & 0xFF] +
+		   pg_number_of_ones[(word >> 8) & 0xFF] +
+		   pg_number_of_ones[word & 0xFF];
 #endif							/* HAVE__BUILTIN_POPCOUNT */
 }
 
@@ -378,15 +373,14 @@ pg_popcount64_slow(uint64 word)
 #error "cannot find integer of the same size as uint64_t"
 #endif
 #else							/* !HAVE__BUILTIN_POPCOUNT */
-	int			result = 0;
-
-	while (word != 0)
-	{
-		result += pg_number_of_ones[word & 255];
-		word >>= 8;
-	}
-
-	return result;
+	return pg_number_of_ones[word >> 56] +
+		   pg_number_of_ones[(word >> 48) & 0xFF] +
+		   pg_number_of_ones[(word >> 40) & 0xFF] +
+		   pg_number_of_ones[(word >> 32) & 0xFF] +
+		   pg_number_of_ones[(word >> 24) & 0xFF] +
+		   pg_number_of_ones[(word >> 16) & 0xFF] +
+		   pg_number_of_ones[(word >> 8) & 0xFF] +
+		   pg_number_of_ones[word >> 0 & 0xFF];
 #endif							/* HAVE__BUILTIN_POPCOUNT */
 }
 
-- 
2.43.0

David Geier

geidav.pg@gmail.com

about 1 month ago

In reply to: Andrew Pogrebnoi (#1)

Re: Popcount optimization for the slow-path lookups

Hi Andy!

On 05.12.2025 14:07, Andrew Pogrebnoi wrote:

Hello hackers,

I want to propose an optimization for pg_popcount32_slow() and
pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
compared to the current, looped version. This is on x86. It is much more
significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if
SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from
pg_popcount32/64_slow(), measuring them against branchless counterparts.
Then in C++ I pre generated two static vectors with random uint64's and
uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
run it on M1 and x86 Macs.

Results:

X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```

Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```

I also tried the Parallel summing of adjacent bits (see Hacker's Delight
[1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...

static inline int
pg_popcount_word64(uint64 word)
{
word = word - ((word >> 1) & POPCOUNT_M0);
word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
word = ((word >> 4) + word) & POPCOUNT_M2;
word += word >> 8;
word += word >> 16;
word += word >> 32; // this step is omitted for the uint32 version

return word & 0x3F;
}
```

However, it doesn't show any improvements compared to the branchless look
up on M1 and a slight gain for uint64 on x86.

The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```

[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741

-----
Cheers,
Andy

I would like to test if I can reproduce your results. Could you share
your test program?

You also don't specify an optimization level. That means the default
level -O0 is used. Does it make it a difference if you run e.g. with -O3?

--
David Geier

Andrew Pogrebnoi

andrew.pogrebnoi@percona.com

about 1 month ago

In reply to: David Geier (#2)

Re: Popcount optimization for the slow-path lookups

Hi David,

Thanks for looking at it!

I would like to test if I can reproduce your results. Could you share
your test program?

Here you go:
https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a

You also don't specify an optimization level. That means the default
level -O0 is used. Does it make it a difference if you run e.g. with -O3?

Yes, good point. I got carried away with experiments and totally forgot
that..

I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits
might make sense for uint64:

X86
```

% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark

2025-12-05T16:22:43+02:00

Running ./mybenchmark

Run on (12 X 3200 MHz CPU s)

CPU Caches:

L1 Data 32 KiB

L1 Instruction 32 KiB

L2 Unified 256 KiB (x6)

L3 Unified 12288 KiB

Load Average: 2.67, 4.72, 2.90

--------------------------------------------------------------------

Benchmark Time CPU Iterations

--------------------------------------------------------------------

Popcnt64_BranchlessLookup 5.92 ns 5.90 ns 113871130

Popcnt64_AdjacentBits 5.30 ns 5.27 ns 115084258

Popcnt64_PG 8.34 ns 8.32 ns 80622869

Popcnt32_BranchlessLookup 3.62 ns 3.61 ns 197816110

Popcnt32_AdjacentBits 4.56 ns 4.55 ns 154932383

Popcnt32_PG 5.32 ns 5.31 ns 121845083
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark &&
./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 2.13 ns 2.13 ns 342101937
Popcnt64_AdjacentBits 1.89 ns 1.89 ns 370316571
Popcnt64_PG 3.83 ns 3.83 ns 190990088
Popcnt32_BranchlessLookup 1.19 ns 1.19 ns 591797267
Popcnt32_AdjacentBits 1.73 ns 1.73 ns 409381266
Popcnt32_PG 2.01 ns 2.01 ns 344969625
```

---
Cheers,
Andy

Nathan Bossart

nathandbossart@gmail.com

about 1 month ago

In reply to: Andrew Pogrebnoi (#1)

Re: Popcount optimization for the slow-path lookups

On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote:

I want to propose an optimization for pg_popcount32_slow() and
pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
compared to the current, looped version. This is on x86. It is much more
significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if
SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from
pg_popcount32/64_slow(), measuring them against branchless counterparts.
Then in C++ I pre generated two static vectors with random uint64's and
uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
run it on M1 and x86 Macs.

I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks. For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
to use Neon or SVE. And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere). IMHO we need to run
the benchmarks on a compiler/architecture combination where it would
actually be used in practice.

--
nathan

Andrew Pogrebnoi

andrew.pogrebnoi@percona.com

about 1 month ago

In reply to: Nathan Bossart (#4)

Re: Popcount optimization for the slow-path lookups

On Fri, Dec 5, 2025 at 5:40 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks. For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized

to use Neon or SVE. And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere). IMHO we need to

run

the benchmarks on a compiler/architecture combination where it would
actually be used in practice.

Yes, I saw that the code is on a rather obscure path, but those machines
were my only options for quick benchmarks.
I reasoned that the code path still exists, and eliminating branching there
would be beneficial anyway
(most probably). But you are right, we need to test it on target
architectures/compilers. I'll try to do with that.

---
Cheers,
Andy

John Naylor

johncnaylorls@gmail.com

about 1 month ago

In reply to: Nathan Bossart (#4)

Re: Popcount optimization for the slow-path lookups

On Fri, Dec 5, 2025 at 10:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks. For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
to use Neon or SVE. And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere). IMHO we need to run
the benchmarks on a compiler/architecture combination where it would
actually be used in practice.

Yeah, if we did anything here, I'd rather arrange so that
architectures that have unconditional hardware support can inline it
at compile time. I believe ppc64le and aarch64 can do that
unconditionally. For x86 we might be able to detect some symbol
defined by the compiler, to do the same thing for OS's that require
such support.

--
John Naylor
Amazon Web Services