refactor architecture-specific popcount code

Started by Nathan Bossart4 months ago40 messageshackers

nathandbossart@gmail.com

4 months ago

Right now, the organization of this code is weird. All AArch64-specific
implementations live in an AArch64-specific file, the AVX-512
implementations live in their own file, and the architecture-agnostic and
SSE4.2 implementations live in pg_bitutils.c. The attached patches move
the SSE4.2 implementations to the AVX-512 file (which is renamed
appropriately), and they update some function names to be more descriptive,
i.e., "fast" is replaced with "sse42" and "slow" is replaced with
"generic".

I probably should've done this a while ago...

--
nathan

John Naylor

john.naylor@enterprisedb.com

4 months ago

In reply to: Nathan Bossart (#1)

Re: refactor architecture-specific popcount code

On Thu, Jan 15, 2026 at 3:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Right now, the organization of this code is weird. All AArch64-specific
implementations live in an AArch64-specific file, the AVX-512
implementations live in their own file, and the architecture-agnostic and
SSE4.2 implementations live in pg_bitutils.c. The attached patches move
the SSE4.2 implementations to the AVX-512 file (which is renamed
appropriately), and they update some function names to be more descriptive,
i.e., "fast" is replaced with "sse42" and "slow" is replaced with
"generic".

Thanks for taking on some technical debt!

0001

--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_x86_64.c

Can we get away with just "x86" for brevity? We generally don't target
32-bit CPUs for this kind of work, so there's no chance of confusion.

0002

```
-#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_X86_64

#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
#include <cpuid.h>
#endif
```

With the above in the x86 .c file, I wonder we can get rid of this
stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ:

#ifdef HAVE_X86_64_POPCNTQ
#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
#define TRY_POPCNT_X86_64 1
#endif
#endif

If we have to be cautious, we could just turn the #error on no CPUID
symbol into "return false".

0003

s/fast/sse42/:

Seems okay in this file, but this isn't the best name, either. Maybe a
comment to head off future "corrections", something like:
"Technically, POPCNT is not part of SSE 4.2, and is not even a vector
operation, but many compilers emit the popcnt instruction with
-msse4.2 anyway."

s/slow/generic/:

I'm ambivalent about this. The "slow" designation is flat-out wrong
since at least Power and aarch64 can emit a single instruction here
without prodding the compiler. On the other hand, "generic" seems
wrong too, since e.g. pg_popcount64_slow() has three configure symbols
and two compiler builtins. :-D

A possible future project would be to have a truly generic simple
fallback in pure C and put all the fancy stuff in the header for
architectures that have unconditional hardware support. It would make
more sense to revisit the name then.

--
John Naylor
Amazon Web Services

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

4 months ago

In reply to: John Naylor (#2)

Re: refactor architecture-specific popcount code

On 15/01/2026 11:07, John Naylor wrote:

0003

s/fast/sse42/:

Seems okay in this file, but this isn't the best name, either. Maybe a
comment to head off future "corrections", something like:
"Technically, POPCNT is not part of SSE 4.2, and is not even a vector
operation, but many compilers emit the popcnt instruction with
-msse4.2 anyway."

s/slow/generic/:

I'm ambivalent about this. The "slow" designation is flat-out wrong
since at least Power and aarch64 can emit a single instruction here
without prodding the compiler. On the other hand, "generic" seems
wrong too, since e.g. pg_popcount64_slow() has three configure symbols
and two compiler builtins. :-D

"fallback", or "portable" ?

A possible future project would be to have a truly generic simple
fallback in pure C and put all the fancy stuff in the header for
architectures that have unconditional hardware support. It would make
more sense to revisit the name then.

Yeah, I noticed that on x86_64, pg_popcount_optimized is always a
function pointer with runtime check, even if you use compiler flags to
target a CPU where the special instructions are available unconditionally.

- Heikki

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: John Naylor (#2)

Re: refactor architecture-specific popcount code

On Thu, Jan 15, 2026 at 04:07:51PM +0700, John Naylor wrote:

Thanks for taking on some technical debt!

Thanks for reviewing.

--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_x86_64.c
Can we get away with just "x86" for brevity? We generally don't target
32-bit CPUs for this kind of work, so there's no chance of confusion.

WFM.

```
-#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_X86_64
#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
#include <cpuid.h>
#endif
```

With the above in the x86 .c file, I wonder we can get rid of this
stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ:

#ifdef HAVE_X86_64_POPCNTQ
#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
#define TRY_POPCNT_X86_64 1
#endif
#endif

If we have to be cautious, we could just turn the #error on no CPUID
symbol into "return false".

Yeah, the CPUID macro checks do seem overly cautious to me, especially
since we've just #error'd when the CPUID intrinsics are missing in
pg_crc32c_sse42_choose.c since 2015. That seems to suggest that nobody is
trying to build Postgres with a compiler that knows about SSE4.2/POPCNT but
not CPUID. For reference, CPUID was introduced in 1993.

I bet we could also convert this bit into a configuration-time check:

#if defined(_MSC_VER) && defined(_M_AMD64)
#define HAVE_X86_64_POPCNTQ
#endif

s/fast/sse42/:

Seems okay in this file, but this isn't the best name, either. Maybe a
comment to head off future "corrections", something like:
"Technically, POPCNT is not part of SSE 4.2, and is not even a vector
operation, but many compilers emit the popcnt instruction with
-msse4.2 anyway."

Makes sense.

--
nathan

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Heikki Linnakangas (#3)

Re: refactor architecture-specific popcount code

On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:

On 15/01/2026 11:07, John Naylor wrote:

s/slow/generic/:

I'm ambivalent about this. The "slow" designation is flat-out wrong
since at least Power and aarch64 can emit a single instruction here
without prodding the compiler. On the other hand, "generic" seems
wrong too, since e.g. pg_popcount64_slow() has three configure symbols
and two compiler builtins. :-D

"fallback", or "portable" ?

I've no strong opinions, but "portable" seems reasonable to me.

Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function
pointer with runtime check, even if you use compiler flags to target a CPU
where the special instructions are available unconditionally.

I wonder how close we are to being able to just require SSE4.2/POPCNT for
x86-64 builds. I suppose there's always a chance that someone will try to
run Postgres 19 on a CPU from the aughts... In any case, avoiding the
function pointer when possible seems like a good follow-up.

--
nathan

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#5)

Re: refactor architecture-specific popcount code

On 15/01/2026 18:08, Nathan Bossart wrote:

On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:

Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function
pointer with runtime check, even if you use compiler flags to target a CPU
where the special instructions are available unconditionally.

I wonder how close we are to being able to just require SSE4.2/POPCNT for
x86-64 builds. I suppose there's always a chance that someone will try to
run Postgres 19 on a CPU from the aughts... In any case, avoiding the
function pointer when possible seems like a good follow-up.

It's not really our decision. Packagers choose which architecture to
target and which compiler options to build with. We ought to just
respect those choices.

- Heikki

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: Heikki Linnakangas (#6)

Re: refactor architecture-specific popcount code

Here is a new patch set. Notably, I've added a 0004 that does the
following:

* Removes TRY_POPCNT_X86_64. We now assume that the required CPUID
intrinsics are available, as we have long done in some of the CRC-32C code.

* Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time. This
way, we set it for all relevant platforms in one place.

* Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
used elsewhere without simd.h. Consequently, we can remove POPCNT_AARCH64.

* Moves the #includes for pg_bitutils.h to below the system headers in
pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
pg_bitutils.h).

--
nathan

John Naylor

john.naylor@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#7)

Re: refactor architecture-specific popcount code

On Fri, Jan 16, 2026 at 2:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Here is a new patch set. Notably, I've added a 0004 that does the
following:

* Removes TRY_POPCNT_X86_64. We now assume that the required CPUID
intrinsics are available, as we have long done in some of the CRC-32C code.

* Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time. This
way, we set it for all relevant platforms in one place.

LGTM.

* Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
used elsewhere without simd.h. Consequently, we can remove POPCNT_AARCH64.

Seems reasonable.

* Moves the #includes for pg_bitutils.h to below the system headers in
pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
pg_bitutils.h).

Good.

v2-0003

+ * XXX Technically, POPCNT is not part of SSE4.2, and isn't even a vector
+ * operation, but in practice this is close enough, and "sse42" seems easier to
+ * follow than "popcnt" for these names.

Quibble -- XXX is a bit loud for a side note.

On Thu, Jan 15, 2026 at 11:08 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:

"fallback", or "portable" ?

I've no strong opinions, but "portable" seems reasonable to me.

I have a mild preference for "fallback" since it's a noun and we
already have comments in src/(include/)port referring to fallbacks. No
objection to "portable", though.

--
John Naylor
Amazon Web Services

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#8)

Re: refactor architecture-specific popcount code

Committed, thanks for reviewing.

--
nathan

#10

John Naylor

john.naylor@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#9)

Re: refactor architecture-specific popcount code

On Thu, Jan 22, 2026 at 3:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Committed, thanks for reviewing.

Sure. Now that that's in place, I wanted to brainstorm more
refactoring/rationalization ideas that seem on-topic for the thread
but have less clear payoff:

1) Nowadays, the only global call sites of the word-sized functions
are select_best_grantor() and in bitmapsets. The latter calls the
word-sized functions in a loop (could be just one word). It may be
more efficient to calculate the size in bytes and call pg_popcount().
Then we could get rid of all the pointer indirection for the
word-sized functions.

2) The x86 byte buffer variants expend a lot of effort to detect
whether the buffer is aligned on both 64- and 32-bit platforms, with
an optimized path for each. At least 64-bit doesn't care about
alignment, and 32-bit doesn't warrant anything fancier than pure C.
Simultaneously, the aarch64 equivalent doesn't seem to take care about
alignment. (I think Nathan mentioned he didn't see a difference during
testing, but I wonder how universal that is).

3) There is repeated code for the <8 bytes case, and the tail of the
"optimized" functions. I'm also not sure why the small case is inlined
everywhere.

--
John Naylor
Amazon Web Services

#11

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#10)

Re: refactor architecture-specific popcount code

On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:

1) Nowadays, the only global call sites of the word-sized functions
are select_best_grantor() and in bitmapsets. The latter calls the
word-sized functions in a loop (could be just one word). It may be
more efficient to calculate the size in bytes and call pg_popcount().

Yeah, these seem like obvious places to use pg_popcount(). Note that
bms_member_index() does a final popcount on a masked version of the last
word. We could swap that with pg_popcount(), too, but it might be slower
than just calling the word-sized function. However, it could be hard to
tell the difference, as we'd be trading a function or function pointer call
with an inlined loop over pg_number_of_ones. And even if it is slower, I'm
not sure it matters all that much in the grand scheme of things.

In any case, 0001 gets the easy ones out of the way.

Then we could get rid of all the pointer indirection for the
word-sized functions.

Do you mean that we'd just keep the portable ones around? I see some code
in pgvector that might be negatively impacted by that, but if I understand
correctly it would require an unusual setup.

2) The x86 byte buffer variants expend a lot of effort to detect
whether the buffer is aligned on both 64- and 32-bit platforms, with
an optimized path for each. At least 64-bit doesn't care about
alignment, and 32-bit doesn't warrant anything fancier than pure C.
Simultaneously, the aarch64 equivalent doesn't seem to take care about
alignment. (I think Nathan mentioned he didn't see a difference during
testing, but I wonder how universal that is).

0002 makes these changes for pg_popcount_sse42() and
pg_popcount_masked_sse42(). It does seem strange to prefer a loop over
pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's
worth testing. I do recall the alignment stuff in the AVX-512 code showing
benefits in tests because it avoids double-load overhead, so I think we
should keep that for now.

3) There is repeated code for the <8 bytes case, and the tail of the
"optimized" functions. I'm also not sure why the small case is inlined
everywhere.

This is intended to help avoid function call and SIMD instruction overhead
when it doesn't make sense to take it. I recall this showing a rather big
difference in benchmarks when we were working on the AVX-512 versions.
Regarding the duplicated code, sure, we could add some static inline
functions or something. I think the only reason I haven't done so is
because it's ~2 lines of code.

--
nathan

#12

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: Nathan Bossart (#11)

Re: refactor architecture-specific popcount code

On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote:

On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:

1) Nowadays, the only global call sites of the word-sized functions
are select_best_grantor() and in bitmapsets. The latter calls the
word-sized functions in a loop (could be just one word). It may be
more efficient to calculate the size in bytes and call pg_popcount().

Yeah, these seem like obvious places to use pg_popcount(). Note that
bms_member_index() does a final popcount on a masked version of the last
word. We could swap that with pg_popcount(), too, but it might be slower
than just calling the word-sized function. However, it could be hard to
tell the difference, as we'd be trading a function or function pointer call
with an inlined loop over pg_number_of_ones. And even if it is slower, I'm
not sure it matters all that much in the grand scheme of things.

I added a 0003 that swaps that final popcount with pg_popcount().

Then we could get rid of all the pointer indirection for the
word-sized functions.

Do you mean that we'd just keep the portable ones around? I see some code
in pgvector that might be negatively impacted by that, but if I understand
correctly it would require an unusual setup.

I added a 0004 that removes the architecture-specific word-sized functions.

--
nathan

#13

John Naylor

john.naylor@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#12)

Re: refactor architecture-specific popcount code

On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote:

On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:

1) Nowadays, the only global call sites of the word-sized functions
are select_best_grantor() and in bitmapsets. The latter calls the
word-sized functions in a loop (could be just one word). It may be
more efficient to calculate the size in bytes and call pg_popcount().

Yeah, these seem like obvious places to use pg_popcount(). Note that
bms_member_index() does a final popcount on a masked version of the last
word. We could swap that with pg_popcount(), too, but it might be slower
than just calling the word-sized function. However, it could be hard to
tell the difference, as we'd be trading a function or function pointer call
with an inlined loop over pg_number_of_ones. And even if it is slower, I'm
not sure it matters all that much in the grand scheme of things.

I added a 0003 that swaps that final popcount with pg_popcount().

I'm not sure either if this part matters much, but it makes more sense
to me to continue using single word functions for that last part.
Since they have very few call sites anymore, we can make them inline
without bloating the binary on x86.

Then we could get rid of all the pointer indirection for the
word-sized functions.

Do you mean that we'd just keep the portable ones around? I see some code
in pgvector that might be negatively impacted by that, but if I understand
correctly it would require an unusual setup.

I added a 0004 that removes the architecture-specific word-sized functions.

Right, just the portable ones. Here, too, inlining them everywhere
would mitigate any impact. Further,

-int
-pg_popcount64(uint64 word)
+static inline int
+pg_popcount64_neon(uint64 word)

...if they were inlined from the header, I think we wouldn't need this
separate neon function in this file at all. Currently, we rely on
__builtin_popcountl for the portable function outside this file. We
could either keep using that or switch to neon if there's a
portability difference.

0001, 2, and 4 look good to me otherwise. Looking at commit
a8c09daa8bb1, there are some planner benchmarks in the discussion
thread that stress bitmapsets, but I'm not sure if they hit the paths
affected by 0001 at all.

--
John Naylor
Amazon Web Services

#14

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#13)

Re: refactor architecture-specific popcount code

On Thu, Jan 29, 2026 at 06:31:53PM +0700, John Naylor wrote:

On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I added a 0003 that swaps that final popcount with pg_popcount().

I'm not sure either if this part matters much, but it makes more sense
to me to continue using single word functions for that last part.
Since they have very few call sites anymore, we can make them inline
without bloating the binary on x86.

Okay, I abandoned that patch.

Right, just the portable ones. Here, too, inlining them everywhere
would mitigate any impact.

Done.

+static inline int
+pg_popcount64_neon(uint64 word)

...if they were inlined from the header, I think we wouldn't need this
separate neon function in this file at all. Currently, we rely on
__builtin_popcountl for the portable function outside this file. We
could either keep using that or switch to neon if there's a
portability difference.

Done.

--
nathan

#15

John Naylor

john.naylor@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#14)

Re: refactor architecture-specific popcount code

On Fri, Jan 30, 2026 at 12:06 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

[v5]

0001 - I'm pretty sure this is comparable to HEAD if the optimized
function is pg_popcount_sse42(). Has the AVX512 version been tested
with 8-byte inputs? That seems to have a lot of pre- and
post-processing involved. The inline wrapper only bypasses for 7 or
less bytes.

0002
- I tried running this on x86-64 with alignment sanitizer and no
alarms went off during "make check", but adding
pg_attribute_no_sanitize_alignment() would prevent surprises in the
future.
- I imagine that the old SIZEOF_VOID_P check is superfluous now, since
the whole file is gated by HAVE_X86_64_POPCNTQ.
- Maybe we can remove the aligned 32-bit path in
pg_popcount_(masked_)portable(), since that's on-topic for this patch
and would simplify things further.

--
John Naylor
Amazon Web Services

#16

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#15)

Re: refactor architecture-specific popcount code

On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote:

0001 - I'm pretty sure this is comparable to HEAD if the optimized
function is pg_popcount_sse42(). Has the AVX512 version been tested
with 8-byte inputs? That seems to have a lot of pre- and
post-processing involved. The inline wrapper only bypasses for 7 or
less bytes.

Here [0]/messages/by-id/20240404171828.GA3866970@nathanxps13 is the latest perf data I see for the AVX-512 popcount patch,
although that's comparing to v16, which IIRC lacks a few other inlining
tricks. There's a chance the SSE4.2 version is faster at that particular
length. I'm not sure we need to worry about that, but I can do a bit of
testing if you'd like.

0002
- I tried running this on x86-64 with alignment sanitizer and no
alarms went off during "make check", but adding
pg_attribute_no_sanitize_alignment() would prevent surprises in the
future.

Done.

- I imagine that the old SIZEOF_VOID_P check is superfluous now, since
the whole file is gated by HAVE_X86_64_POPCNTQ.

I think you're right. There was some concern about this when I was first
adding the SSE4.2-specific pg_popcount() [1]/messages/by-id/CAApHDvojPyh6dLKooqjXSZE=0Ed590Lq1BxF7WQ9knSggyuJEA@mail.gmail.com, but all the configure-time
checks for HAVE_X86_64_POPCNTQ are restricted to 64-bit x86, so I bet we
could safely assume SIZEOF_VOID_P == 8 in that file.

- Maybe we can remove the aligned 32-bit path in
pg_popcount_(masked_)portable(), since that's on-topic for this patch
and would simplify things further.

IMHO that's a reasonable thing for us to do.

[0]: /messages/by-id/20240404171828.GA3866970@nathanxps13
[1]: /messages/by-id/CAApHDvojPyh6dLKooqjXSZE=0Ed590Lq1BxF7WQ9knSggyuJEA@mail.gmail.com

--
nathan

#17

John Naylor

john.naylor@enterprisedb.com

3 months ago

In reply to: Nathan Bossart (#16)

Re: refactor architecture-specific popcount code

On Sat, Jan 31, 2026 at 4:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote:

0001 - I'm pretty sure this is comparable to HEAD if the optimized
function is pg_popcount_sse42(). Has the AVX512 version been tested
with 8-byte inputs? That seems to have a lot of pre- and
post-processing involved. The inline wrapper only bypasses for 7 or
less bytes.

Here [0] is the latest perf data I see for the AVX-512 popcount patch,
although that's comparing to v16, which IIRC lacks a few other inlining
tricks. There's a chance the SSE4.2 version is faster at that particular
length. I'm not sure we need to worry about that, but I can do a bit of
testing if you'd like.

It might be a good idea to do a little new testing, and I see a use
for a special 8-byte path independent of AVX512: v6 seems to regress a
little for single-words. But, it turns out that when gcc turns
__builtin_popcountl into a single instruction, it's inline, but if it
emits portable bitwise ops, it does so in a function called
__popcountdi2(). That can be avoided by hand-coding in C for normal
builds (and for 32-bit looks cleaner anyway), as in the attached 0005.

My laptop here is really too old to make decisions that are
micro-architecture dependent, but with that caveat, I dusted off the
popcount benchmark and added a test for counting bitmapsets (v7-0004,
applies on top of v6):

select drive_bms_num_members(10000000, 1);

master: 13.2 ticks per call
v6: 15.3
v6+v7-0005 10.8

Again, take this with a grain of salt, but 0005 seems worth looking at.

--
John Naylor
Amazon Web Services

#18

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#17)

Re: refactor architecture-specific popcount code

On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:

It might be a good idea to do a little new testing, and I see a use
for a special 8-byte path independent of AVX512: v6 seems to regress a
little for single-words. But, it turns out that when gcc turns
__builtin_popcountl into a single instruction, it's inline, but if it
emits portable bitwise ops, it does so in a function called
__popcountdi2(). That can be avoided by hand-coding in C for normal
builds (and for 32-bit looks cleaner anyway), as in the attached 0005.

Oh, interesting. I looked into this a little more [0]https://godbolt.org/z/he3WozG3E. Both gcc and clang
generate cnt instructions for aarch64, so we're good there. However, clang
on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
a call to __popcountdi2() (which I imagine does something similar). It's
not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
start generating popcnt instructions for x86-64, which makes sense. 0005
seems like the correct move to me...

[0]: https://godbolt.org/z/he3WozG3E

--
nathan

#19

Greg Burd

greg@burd.me

3 months ago

In reply to: Nathan Bossart (#18)

Re: refactor architecture-specific popcount code

On Mon, Feb 2, 2026, at 5:51 PM, Nathan Bossart wrote:

On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:

It might be a good idea to do a little new testing, and I see a use
for a special 8-byte path independent of AVX512: v6 seems to regress a
little for single-words. But, it turns out that when gcc turns
__builtin_popcountl into a single instruction, it's inline, but if it
emits portable bitwise ops, it does so in a function called
__popcountdi2(). That can be avoided by hand-coding in C for normal
builds (and for 32-bit looks cleaner anyway), as in the attached 0005.

Oh, interesting. I looked into this a little more [0]. Both gcc and clang
generate cnt instructions for aarch64, so we're good there. However, clang
on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
a call to __popcountdi2() (which I imagine does something similar). It's
not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
start generating popcnt instructions for x86-64, which makes sense. 0005
seems like the correct move to me...

[0] https://godbolt.org/z/he3WozG3E

--
nathan

Nathan, John,

Thanks for the focus on this area of the code. I've been looking into what to do with popcnt when building Win11/ARM64/MSVC. I know that when _MSC_VER and _M_ARM64 are defined we can make use of the __popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which have been available since VS 2022 17.11+. I thought I'd check that combo out and it turns out that it is identical to clang/gcc on that platform [0]https://godbolt.org/z/TrxjzcPGE.

I'll wait for your work to land before proposing a patch to add these unless it is really easy to fit it and you feel like giving it a go. :)

best.

-greg

[0]: https://godbolt.org/z/TrxjzcPGE

#20

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: Greg Burd (#19)

Re: refactor architecture-specific popcount code

On Tue, Feb 03, 2026 at 12:19:48PM -0500, Greg Burd wrote:

Thanks for the focus on this area of the code. I've been looking into
what to do with popcnt when building Win11/ARM64/MSVC. I know that when
_MSC_VER and _M_ARM64 are defined we can make use of the
__popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which
have been available since VS 2022 17.11+. I thought I'd check that combo
out and it turns out that it is identical to clang/gcc on that platform
[0].

I'll wait for your work to land before proposing a patch to add these
unless it is really easy to fit it and you feel like giving it a go. :)

We should probably just add something like

#ifdef _MSC_VER
return __popcnt(word);

for the new inlined versions of pg_popcount{32,64}. We're already doing
that today for x86-64, and the AArch64-specific versions use intrinsics
that in theory compile to the same thing. Plus, popcnt is required for
Windows these days.

I'm working on polishing/benchmarking these patches at the moment, so I
will work this change in. Thanks!

--
nathan

#21

Greg Burd

greg@burd.me

3 months ago

In reply to: Nathan Bossart (#20)

#22

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: Greg Burd (#21)