Use POPCNT on MSVC

Started by David Rowleyover 4 years ago8 messages
#1David Rowley
dgrowleyml@gmail.com
1 attachment(s)

Going by [1]https://docs.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-140, it looks like we can use the __popcnt and __popcnt64
intrinsic functions on MSVC if the CPU supports POPCNT. We already
have code to check for that, we just need to enable it on MSVC.

The attached patch seems to be all that's needed.

David

[1]: https://docs.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-140

Attachments:

use_popcnt_on_msvc.patchapplication/octet-stream; name=use_popcnt_on_msvc.patchDownload
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2252021854..dcbd270b71 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -110,7 +110,7 @@ const uint8 pg_number_of_ones[256] = {
  * Otherwise, we fall back to __builtin_popcount if the compiler has that,
  * or a hand-rolled implementation if not.
  */
-#ifdef HAVE_X86_64_POPCNTQ
+#if defined(HAVE_X86_64_POPCNTQ) || (defined(_MSC_VER) && defined(_WIN64))
 #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
 #define USE_POPCNT_ASM 1
 #endif
@@ -201,10 +201,14 @@ pg_popcount64_choose(uint64 word)
 static int
 pg_popcount32_asm(uint32 word)
 {
+#ifdef _MSC_VER
+	return __popcnt(word);
+#else
 	uint32		res;
 
 __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
 	return (int) res;
+#endif
 }
 
 /*
@@ -214,10 +218,14 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
 static int
 pg_popcount64_asm(uint64 word)
 {
+#ifdef _MSC_VER
+	return __popcnt64(word);
+#else
 	uint64		res;
 
 __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 	return (int) res;
+#endif
 }
 
 #endif							/* USE_POPCNT_ASM */
#2John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#1)
Re: Use POPCNT on MSVC

On Tue, Aug 3, 2021 at 5:03 AM David Rowley <dgrowleyml@gmail.com> wrote:

Going by [1], it looks like we can use the __popcnt and __popcnt64
intrinsic functions on MSVC if the CPU supports POPCNT. We already
have code to check for that, we just need to enable it on MSVC.

The attached patch seems to be all that's needed.

+1 for the concept, but the coding is a bit strange. Granted, the way we
handle popcnt is a bit strange, but this makes it stranger:

1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a
bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would
help it look better. But then looking around, other platforms have
intrinsics coded, but for some reason they're put in pg_popcount64_slow(),
where the compiler will emit instructions we could easily write ourselves
in C (and without #ifdefs) since without the right CFLAGS these intrinsics
won't emit a popcnt instruction. Is MSVC different in that regard? If so,
it might be worth a comment.

2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the
CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange
because the latter symbol comes from a specific configure test -- the two
don't seem equivalent, but maybe they are because of what MSVC does? That
would be nice to spell out here.

I know the present state of affairs works a bit different than what you
proposed a couple years ago and that something had to change for
portability reasons I didn't quite understand, but I think if we change
things here we should at least try to have things fit together a bit more
nicely.

(Side note, but sort of related to #1 above: non-x86 platforms have to
indirect through a function pointer even though they have no fast
implementation to make it worth their while. It would be better for them if
the "slow" implementation was called static inline or at least a direct
function call, but that's a separate thread.)

--
John Naylor
EDB: http://www.enterprisedb.com

#3Thomas Munro
thomas.munro@gmail.com
In reply to: John Naylor (#2)
Re: Use POPCNT on MSVC

On Tue, Aug 3, 2021 at 10:43 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

(Side note, but sort of related to #1 above: non-x86 platforms have to indirect through a function pointer even though they have no fast implementation to make it worth their while. It would be better for them if the "slow" implementation was called static inline or at least a direct function call, but that's a separate thread.)

+1

I haven't looked into whether we could benefit from it in real use
cases, but it seems like it'd also be nice if pg_popcount() were a
candidate for auto-vectorisation and inlining. For example, NEON has
vector popcount, and for Intel/AMD there is a shuffle-based AVX2 trick
that at least Clang produces automatically[1]https://arxiv.org/abs/1611.07612. We're obstructing that
by doing function dispatch at individual word level, and using inline
assembler instead of builtins.

[1]: https://arxiv.org/abs/1611.07612

#4David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#2)
Re: Use POPCNT on MSVC

On Tue, 3 Aug 2021 at 22:43, John Naylor <john.naylor@enterprisedb.com> wrote:

1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would help it look better. But then looking around, other platforms have intrinsics coded, but for some reason they're put in pg_popcount64_slow(), where the compiler will emit instructions we could easily write ourselves in C (and without #ifdefs) since without the right CFLAGS these intrinsics won't emit a popcnt instruction. Is MSVC different in that regard? If so, it might be worth a comment.

Yeah, the names no longer fit so well after stuffing the MSVC
intrinsic into the asm function. The reason I did it that way is down
to what I read in the docs. Namely:

"If you run code that uses these intrinsics on hardware that doesn't
support the popcnt instruction, the results are unpredictable."

So, it has to go somewhere where we're sure the CPU supports POPCNT
and that seemed like the correct place.

From what I understand of GCC and __builtin_popcountl(), the code
it'll output will depend on the -mpopcnt flag. So having
__builtin_popcountl() does not mean the popcnt instruction will be
used. The Microsoft documentation indicates that's not the case for
__popcnt().

2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange because the latter symbol comes from a specific configure test -- the two don't seem equivalent, but maybe they are because of what MSVC does? That would be nice to spell out here.

The problem there is that we define HAVE_X86_64_POPCNTQ based on the
outcome of configure so it does not get set for MSVC. Maybe it's
worth coming up with some other constant that can be defined or we
could just do:

#if defined(_MSC_VER) && defined(_WIN64)
#define HAVE_X86_64_POPCNTQ
#endif

I know the present state of affairs works a bit different than what you proposed a couple years ago and that something had to change for portability reasons I didn't quite understand, but I think if we change things here we should at least try to have things fit together a bit more nicely.

I think the reason for the asm is that __builtin_popcountl does not
mean popcnt will be used. Maybe we could have done something like
compile pg_bitutils.c with -mpopcnt, and have kept the run-time check,
but the problem there is that means that the compiler might end up
using that instruction in some other function in that file that we
don't want it to. It looks like my patch in [1]/messages/by-id/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com did pass the -mpopcnt
flag which Tom fixed.

(Side note, but sort of related to #1 above: non-x86 platforms have to indirect through a function pointer even though they have no fast implementation to make it worth their while. It would be better for them if the "slow" implementation was called static inline or at least a direct function call, but that's a separate thread.)

hmm yeah. I see there's a few more usages of pg_popcount() than when I
looked last. It would be fairly easy to get rid of the
pg_popcount64/pg_popcount32 function call in that. A separate patch
though.

David

[1]: /messages/by-id/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com

#5John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#4)
Re: Use POPCNT on MSVC

On Tue, Aug 3, 2021 at 11:36 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 3 Aug 2021 at 22:43, John Naylor <john.naylor@enterprisedb.com>

wrote:

1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which

is a bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/
would help it look better. But then looking around, other platforms have
intrinsics coded, but for some reason they're put in pg_popcount64_slow(),
where the compiler will emit instructions we could easily write ourselves
in C (and without #ifdefs) since without the right CFLAGS these intrinsics
won't emit a popcnt instruction. Is MSVC different in that regard? If so,
it might be worth a comment.

Yeah, the names no longer fit so well after stuffing the MSVC
intrinsic into the asm function. The reason I did it that way is down
to what I read in the docs. Namely:

"If you run code that uses these intrinsics on hardware that doesn't
support the popcnt instruction, the results are unpredictable."

So, it has to go somewhere where we're sure the CPU supports POPCNT
and that seemed like the correct place.

From what I understand of GCC and __builtin_popcountl(), the code
it'll output will depend on the -mpopcnt flag. So having
__builtin_popcountl() does not mean the popcnt instruction will be
used. The Microsoft documentation indicates that's not the case for
__popcnt().

Okay, "unpredictable" sounds bad.

2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for

the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks
strange because the latter symbol comes from a specific configure test --
the two don't seem equivalent, but maybe they are because of what MSVC
does? That would be nice to spell out here.

The problem there is that we define HAVE_X86_64_POPCNTQ based on the
outcome of configure so it does not get set for MSVC. Maybe it's
worth coming up with some other constant that can be defined or we
could just do:

#if defined(_MSC_VER) && defined(_WIN64)
#define HAVE_X86_64_POPCNTQ
#endif

That seems fine. I don't know PG can build with Arm on Windows, but for the
cpuid to work, it seems safer to also check for __x86_64?

I think the reason for the asm is that __builtin_popcountl does not
mean popcnt will be used. Maybe we could have done something like
compile pg_bitutils.c with -mpopcnt, and have kept the run-time check,
but the problem there is that means that the compiler might end up
using that instruction in some other function in that file that we
don't want it to. It looks like my patch in [1] did pass the -mpopcnt
flag which Tom fixed.

Ah, that makes sense. (If we someday offer a configure option for
x86-64-v2, we can use intrinsics in the asm functions, and call them
directly. Yet another different thread.)

--
John Naylor
EDB: http://www.enterprisedb.com

#6David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#5)
1 attachment(s)
Re: Use POPCNT on MSVC

On Thu, 5 Aug 2021 at 07:02, John Naylor <john.naylor@enterprisedb.com> wrote:

#if defined(_MSC_VER) && defined(_WIN64)
#define HAVE_X86_64_POPCNTQ
#endif

That seems fine. I don't know PG can build with Arm on Windows, but for the cpuid to work, it seems safer to also check for __x86_64?

That's a good point. I've adjusted it to do #if defined(_MSC_VER) &&
defined(_M_AMD64).

I've attached a v2 patch which I think is more along the lines of what
you had in mind.

David

Attachments:

use_popcnt_on_msvc_v2.patchapplication/octet-stream; name=use_popcnt_on_msvc_v2.patchDownload
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index dcbd270b71..d4d65abeef 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,6 +103,15 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+/*
+ * With MSVC on x86_64 builds, try using native popcnt instructions via the
+ * __popcnt and __popcnt64 intrinsics.  These don't work the same as GCC's
+ * __builtin_popcount* macros as they always emit popcnt instructions.
+ */
+#if defined(_MSC_VER) && defined(_M_AMD64)
+#define HAVE_X86_64_POPCNTQ
+#endif
+
 /*
  * On x86_64, we can use the hardware popcount instruction, but only if
  * we can verify that the CPU supports it via the cpuid instruction.
@@ -110,30 +119,30 @@ const uint8 pg_number_of_ones[256] = {
  * Otherwise, we fall back to __builtin_popcount if the compiler has that,
  * or a hand-rolled implementation if not.
  */
-#if defined(HAVE_X86_64_POPCNTQ) || (defined(_MSC_VER) && defined(_WIN64))
+#ifdef HAVE_X86_64_POPCNTQ
 #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
-#define USE_POPCNT_ASM 1
+#define TRY_POPCNT_FAST 1
 #endif
 #endif
 
 static int	pg_popcount32_slow(uint32 word);
 static int	pg_popcount64_slow(uint64 word);
 
-#ifdef USE_POPCNT_ASM
+#ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
-static int	pg_popcount32_asm(uint32 word);
-static int	pg_popcount64_asm(uint64 word);
+static int	pg_popcount32_fast(uint32 word);
+static int	pg_popcount64_fast(uint64 word);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif							/* USE_POPCNT_ASM */
+#endif							/* TRY_POPCNT_FAST */
 
-#ifdef USE_POPCNT_ASM
+#ifdef TRY_POPCNT_FAST
 
 /*
  * Return true if CPUID indicates that the POPCNT instruction is available.
@@ -165,8 +174,8 @@ pg_popcount32_choose(uint32 word)
 {
 	if (pg_popcount_available())
 	{
-		pg_popcount32 = pg_popcount32_asm;
-		pg_popcount64 = pg_popcount64_asm;
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
 	}
 	else
 	{
@@ -182,8 +191,8 @@ pg_popcount64_choose(uint64 word)
 {
 	if (pg_popcount_available())
 	{
-		pg_popcount32 = pg_popcount32_asm;
-		pg_popcount64 = pg_popcount64_asm;
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
 	}
 	else
 	{
@@ -195,11 +204,11 @@ pg_popcount64_choose(uint64 word)
 }
 
 /*
- * pg_popcount32_asm
+ * pg_popcount32_fast
  *		Return the number of 1 bits set in word
  */
 static int
-pg_popcount32_asm(uint32 word)
+pg_popcount32_fast(uint32 word)
 {
 #ifdef _MSC_VER
 	return __popcnt(word);
@@ -212,11 +221,11 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
 }
 
 /*
- * pg_popcount64_asm
+ * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
 static int
-pg_popcount64_asm(uint64 word)
+pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
 	return __popcnt64(word);
@@ -228,7 +237,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 #endif
 }
 
-#endif							/* USE_POPCNT_ASM */
+#endif							/* TRY_POPCNT_FAST */
 
 
 /*
#7John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#6)
Re: Use POPCNT on MSVC

On Sun, Aug 8, 2021 at 8:31 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've attached a v2 patch which I think is more along the lines of what
you had in mind.

LGTM

--
John Naylor
EDB: http://www.enterprisedb.com

#8David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#7)
Re: Use POPCNT on MSVC

On Mon, 9 Aug 2021 at 12:58, John Naylor <john.naylor@enterprisedb.com> wrote:

On Sun, Aug 8, 2021 at 8:31 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've attached a v2 patch which I think is more along the lines of what
you had in mind.

LGTM

Thanks for the review.

Pushed.

David