Using POPCNT and other advanced bit manipulation instructions

Started by David Rowleyabout 7 years ago49 messages
#1David Rowley
david.rowley@2ndquadrant.com
1 attachment(s)

Back in 2016 [1]/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word. Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

Thomas Munro did some work to make this happen but didn't go as far as
adding the required run-time test to allow builds which were built on
a machine with these instructions to work on a machine without them.
We've now got other places in the code which have similar run-time
tests (for example CRC calculation), so I think we should be able to
do the same for the ABM instructions.

Thomas mentions in [1]/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com, to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2]https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount(). I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems. Remember, that's not limited to core code since
pg_config will cause extensions to be compiled with -mpopcnt too.

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member(). The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation. I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call. That
seems to have mostly been done in Thomas' patch [3]/messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com, part of which
I've used for the visibilitymap.c code changes. If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel. 2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4]https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT.

I'm also a little uncertain of my cpuid bit tests. POPCNT appears to
have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages. This
appears to be a variation between Intel and AMD. AMD always implement
either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid
bit flag just for LZCNT and have reserved their own flag for POPCNT
(they didn't implement both at once, as AMD did). I'm a bit uncertain
if AMD will set the Intel POPCNT flag or not, and if they do now, then
I'm not sure if they always did. Intel were 2nd in that race, so I
assume at least the earliest AMD processors would just set only the
AMD flag. Testing might help reveal if I have this right.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

This should be down to the improved performance of
visibilitymap_count(), but it may not be entirely just from faster bit
counter as I also couldn't resist tightening up the inner-most loop.

[1]: /messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com
[2]: https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
[3]: /messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com
[4]: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
[5]: https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchapplication/octet-stream; name=v1-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchDownload
From d2081e48423d4c1703c30f9f375730de4a53cbbf Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 20 Dec 2018 17:46:35 +1300
Subject: [PATCH v1] Add basic support for using the POPCNT and SSE4.2s LZCNT
 opcodes

These opcodes have been around in the AMD world since 2007, and 2008 in
the case of intel. They're supported in GCC and Clang via some __builtin
macros.  The opcodes may be unavailable during runtime, in which case we
fall back on a C-based implementation of the code.  In order to get the
POPCNT instruction we must pass the -mpopcnt option to the compiler, when
supported.

David Rowley and Thomas Munro
---
 config/c-compiler.m4                    | 108 ++++++++
 configure                               | 236 ++++++++++++++++++
 configure.in                            |   9 +
 src/backend/access/heap/visibilitymap.c |  72 ++----
 src/backend/nodes/bitmapset.c           | 161 ++++--------
 src/backend/utils/adt/Makefile          |   2 +-
 src/backend/utils/adt/bitutils.c        | 424 ++++++++++++++++++++++++++++++++
 src/include/pg_config.h.in              |  18 ++
 src/include/pg_config.h.win32           |  18 ++
 src/include/utils/bitutils.h            |  52 ++++
 10 files changed, 937 insertions(+), 163 deletions(-)
 create mode 100644 src/backend/utils/adt/bitutils.c
 create mode 100644 src/include/utils/bitutils.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index af2dea1c2a..ac73416dd1 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -378,6 +378,114 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 
 
 
+# PGAC_C_BUILTIN_POPCOUNT
+# -------------------------
+# Check if the C compiler understands __builtin_popcount(),
+# and define HAVE__BUILTIN_POPCOUNT if so.
+AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
+[AC_CACHE_CHECK(for __builtin_popcount, pgac_cv__builtin_popcount,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);]
+)],
+[pgac_cv__builtin_popcount=yes],
+[pgac_cv__builtin_popcount=no])])
+if test x"$pgac_cv__builtin_popcount" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
+          [Define to 1 if your compiler understands __builtin_popcount.])
+fi])# PGAC_C_BUILTIN_POPCOUNT
+
+
+
+# PGAC_C_BUILTIN_POPCOUNTL
+# -------------------------
+# Check if the C compiler understands __builtin_popcountl(),
+# and define HAVE__BUILTIN_POPCOUNTL if so.
+AC_DEFUN([PGAC_C_BUILTIN_POPCOUNTL],
+[AC_CACHE_CHECK(for __builtin_popcountl, pgac_cv__builtin_popcountl,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcountl(255);]
+)],
+[pgac_cv__builtin_popcountl=yes],
+[pgac_cv__builtin_popcountl=no])])
+if test x"$pgac_cv__builtin_popcountl" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_POPCOUNTL, 1,
+          [Define to 1 if your compiler understands __builtin_popcountl.])
+fi])# PGAC_C_BUILTIN_POPCOUNTL
+
+
+
+# PGAC_C_BUILTIN_CTZ
+# -------------------------
+# Check if the C compiler understands __builtin_ctz(),
+# and define HAVE__BUILTIN_CTZ if so.
+AC_DEFUN([PGAC_C_BUILTIN_CTZ],
+[AC_CACHE_CHECK(for __builtin_ctz, pgac_cv__builtin_ctz,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_ctz(256);]
+)],
+[pgac_cv__builtin_ctz=yes],
+[pgac_cv__builtin_ctz=no])])
+if test x"$pgac_cv__builtin_ctz" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CTZ, 1,
+          [Define to 1 if your compiler understands __builtin_ctz.])
+fi])# PGAC_C_BUILTIN_CTZ
+
+
+
+# PGAC_C_BUILTIN_CTZL
+# -------------------------
+# Check if the C compiler understands __builtin_ctzl(),
+# and define HAVE__BUILTIN_CTZL if so.
+AC_DEFUN([PGAC_C_BUILTIN_CTZL],
+[AC_CACHE_CHECK(for __builtin_ctzl, pgac_cv__builtin_ctzl,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_ctzl(256);]
+)],
+[pgac_cv__builtin_ctzl=yes],
+[pgac_cv__builtin_ctzl=no])])
+if test x"$pgac_cv__builtin_ctzl" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CTZL, 1,
+          [Define to 1 if your compiler understands __builtin_ctzl.])
+fi])# PGAC_C_BUILTIN_CTZL
+
+
+
+# PGAC_C_BUILTIN_CLZ
+# -------------------------
+# Check if the C compiler understands __builtin_clz(),
+# and define HAVE__BUILTIN_CLZ if so.
+AC_DEFUN([PGAC_C_BUILTIN_CLZ],
+[AC_CACHE_CHECK(for __builtin_clz, pgac_cv__builtin_clz,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_clz(256);]
+)],
+[pgac_cv__builtin_clz=yes],
+[pgac_cv__builtin_clz=no])])
+if test x"$pgac_cv__builtin_clz" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CLZ, 1,
+          [Define to 1 if your compiler understands __builtin_clz.])
+fi])# PGAC_C_BUILTIN_CLZ
+
+
+
+# PGAC_C_BUILTIN_CLZL
+# -------------------------
+# Check if the C compiler understands __builtin_clzl(),
+# and define HAVE__BUILTIN_CLZL if so.
+AC_DEFUN([PGAC_C_BUILTIN_CLZL],
+[AC_CACHE_CHECK(for __builtin_clzl, pgac_cv__builtin_clzl,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_clzl(256);]
+)],
+[pgac_cv__builtin_clzl=yes],
+[pgac_cv__builtin_clzl=no])])
+if test x"$pgac_cv__builtin_clzl" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CLZL, 1,
+          [Define to 1 if your compiler understands __builtin_clzl.])
+fi])# PGAC_C_BUILTIN_CLZL
+
+
+
 # PGAC_C_BUILTIN_UNREACHABLE
 # --------------------------
 # Check if the C compiler understands __builtin_unreachable(),
diff --git a/configure b/configure
index ea40f5f03d..3e400fe3e2 100755
--- a/configure
+++ b/configure
@@ -5916,6 +5916,98 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then
 fi
 
 
+  # Enable use of the POPCNT instruction
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcnt, for CFLAGS" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcnt, for CFLAGS... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS} -mpopcnt"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcnt=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcnt" = x"yes"; then
+  CFLAGS="${CFLAGS} -mpopcnt"
+fi
+
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports -mpopcnt, for CXXFLAGS" >&5
+$as_echo_n "checking whether ${CXX} supports -mpopcnt, for CXXFLAGS... " >&6; }
+if ${pgac_cv_prog_CXX_cxxflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CXXFLAGS=$CXXFLAGS
+pgac_save_CXX=$CXX
+CXX=${CXX}
+CXXFLAGS="${CXXFLAGS} -mpopcnt"
+ac_save_cxx_werror_flag=$ac_cxx_werror_flag
+ac_cxx_werror_flag=yes
+ac_ext=cpp
+ac_cpp='$CXXCPP $CPPFLAGS'
+ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  pgac_cv_prog_CXX_cxxflags__mpopcnt=yes
+else
+  pgac_cv_prog_CXX_cxxflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_ext=c
+ac_cpp='$CPP $CPPFLAGS'
+ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_c_compiler_gnu
+
+ac_cxx_werror_flag=$ac_save_cxx_werror_flag
+CXXFLAGS="$pgac_save_CXXFLAGS"
+CXX="$pgac_save_CXX"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CXX_cxxflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CXX_cxxflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CXX_cxxflags__mpopcnt" = x"yes"; then
+  CXXFLAGS="${CXXFLAGS} -mpopcnt"
+fi
+
+
   # Optimization flags for specific files that benefit from vectorization
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5
 $as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; }
@@ -14078,6 +14170,150 @@ if test x"$pgac_cv__builtin_constant_p" = xyes ; then
 
 $as_echo "#define HAVE__BUILTIN_CONSTANT_P 1" >>confdefs.h
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcount" >&5
+$as_echo_n "checking for __builtin_popcount... " >&6; }
+if ${pgac_cv__builtin_popcount+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_popcount(255);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_popcount=yes
+else
+  pgac_cv__builtin_popcount=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_popcount" >&5
+$as_echo "$pgac_cv__builtin_popcount" >&6; }
+if test x"$pgac_cv__builtin_popcount" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_POPCOUNT 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcountl" >&5
+$as_echo_n "checking for __builtin_popcountl... " >&6; }
+if ${pgac_cv__builtin_popcountl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_popcountl(255);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_popcountl=yes
+else
+  pgac_cv__builtin_popcountl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_popcountl" >&5
+$as_echo "$pgac_cv__builtin_popcountl" >&6; }
+if test x"$pgac_cv__builtin_popcountl" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_POPCOUNTL 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
+$as_echo_n "checking for __builtin_ctz... " >&6; }
+if ${pgac_cv__builtin_ctz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_ctz(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_ctz=yes
+else
+  pgac_cv__builtin_ctz=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_ctz" >&5
+$as_echo "$pgac_cv__builtin_ctz" >&6; }
+if test x"$pgac_cv__builtin_ctz" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CTZ 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctzl" >&5
+$as_echo_n "checking for __builtin_ctzl... " >&6; }
+if ${pgac_cv__builtin_ctzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_ctzl(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_ctzl=yes
+else
+  pgac_cv__builtin_ctzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_ctzl" >&5
+$as_echo "$pgac_cv__builtin_ctzl" >&6; }
+if test x"$pgac_cv__builtin_ctzl" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CTZL 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clz" >&5
+$as_echo_n "checking for __builtin_clz... " >&6; }
+if ${pgac_cv__builtin_clz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_clz(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_clz=yes
+else
+  pgac_cv__builtin_clz=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clz" >&5
+$as_echo "$pgac_cv__builtin_clz" >&6; }
+if test x"$pgac_cv__builtin_clz" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CLZ 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_clzl(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"$pgac_cv__builtin_clzl" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CLZL 1" >>confdefs.h
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_unreachable" >&5
 $as_echo_n "checking for __builtin_unreachable... " >&6; }
diff --git a/configure.in b/configure.in
index 89a0fb2470..6a78fef36a 100644
--- a/configure.in
+++ b/configure.in
@@ -504,6 +504,9 @@ if test "$GCC" = yes -a "$ICC" = no; then
   # Disable FP optimizations that cause various errors on gcc 4.5+ or maybe 4.6+
   PGAC_PROG_CC_CFLAGS_OPT([-fexcess-precision=standard])
   PGAC_PROG_CXX_CFLAGS_OPT([-fexcess-precision=standard])
+  # Enable use of the POPCNT instruction
+  PGAC_PROG_CC_CFLAGS_OPT([-mpopcnt])
+  PGAC_PROG_CXX_CFLAGS_OPT([-mpopcnt])
   # Optimization flags for specific files that benefit from vectorization
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
@@ -1488,6 +1491,12 @@ PGAC_C_BUILTIN_BSWAP16
 PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
+PGAC_C_BUILTIN_POPCOUNT
+PGAC_C_BUILTIN_POPCOUNTL
+PGAC_C_BUILTIN_CTZ
+PGAC_C_BUILTIN_CTZL
+PGAC_C_BUILTIN_CLZ
+PGAC_C_BUILTIN_CLZL
 PGAC_C_BUILTIN_UNREACHABLE
 PGAC_C_COMPUTED_GOTO
 PGAC_STRUCT_TIMEZONE
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..01969d399f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -92,6 +92,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "utils/bitutils.h"
 #include "utils/inval.h"
 
 
@@ -115,43 +116,9 @@
 #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
 #define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
 
-/* tables for fast counting of set bits for visible and frozen */
-static const uint8 number_of_ones_for_visible[256] = {
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
-};
-static const uint8 number_of_ones_for_frozen[256] = {
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
-};
+/* Masks for bit counting. */
+#define VISIBLE_MASK64 0x5555555555555555 /* The lower bit of each bit pair */
+#define FROZEN_MASK64 0xaaaaaaaaaaaaaaaa /* The upper bit of each bit pair */
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -408,18 +375,16 @@ void
 visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
 {
 	BlockNumber mapBlock;
+	BlockNumber nvisible = 0;
+	BlockNumber nfrozen = 0;
 
 	/* all_visible must be specified */
 	Assert(all_visible);
 
-	*all_visible = 0;
-	if (all_frozen)
-		*all_frozen = 0;
-
 	for (mapBlock = 0;; mapBlock++)
 	{
 		Buffer		mapBuffer;
-		unsigned char *map;
+		uint64	   *map;
 		int			i;
 
 		/*
@@ -436,17 +401,30 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
 		 * immediately stale anyway if anyone is concurrently setting or
 		 * clearing bits, and we only really need an approximate value.
 		 */
-		map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
+		map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
 
-		for (i = 0; i < MAPSIZE; i++)
+		StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
+						 "unsupported MAPSIZE");
+		if (all_frozen == NULL)
 		{
-			*all_visible += number_of_ones_for_visible[map[i]];
-			if (all_frozen)
-				*all_frozen += number_of_ones_for_frozen[map[i]];
+			for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
+				nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+		}
+		else
+		{
+			for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
+			{
+				nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+				nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
+			}
 		}
 
 		ReleaseBuffer(mapBuffer);
 	}
+
+	*all_visible = nvisible;
+	if (all_frozen)
+		*all_frozen = nfrozen;
 }
 
 /*
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 8ce253c88d..6ad2427af6 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -22,6 +22,7 @@
 
 #include "access/hash.h"
 #include "nodes/pg_list.h"
+#include "utils/bitutils.h"
 
 
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
@@ -51,81 +52,6 @@
 
 #define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
 
-
-/*
- * Lookup tables to avoid need for bit-by-bit groveling
- *
- * rightmost_one_pos[x] gives the bit number (0-7) of the rightmost one bit
- * in a nonzero byte value x.  The entry for x=0 is never used.
- *
- * leftmost_one_pos[x] gives the bit number (0-7) of the leftmost one bit in a
- * nonzero byte value x.  The entry for x=0 is never used.
- *
- * number_of_ones[x] gives the number of one-bits (0-8) in a byte value x.
- *
- * We could make these tables larger and reduce the number of iterations
- * in the functions that use them, but bytewise shifts and masks are
- * especially fast on many machines, so working a byte at a time seems best.
- */
-
-static const uint8 rightmost_one_pos[256] = {
-	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
-};
-
-static const uint8 leftmost_one_pos[256] = {
-	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
-	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
-};
-
-static const uint8 number_of_ones[256] = {
-	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
-};
-
-
 /*
  * bms_copy - make a palloc'd copy of a bitmapset
  */
@@ -607,12 +533,13 @@ bms_singleton_member(const Bitmapset *a)
 			if (result >= 0 || HAS_MULTIPLE_ONES(w))
 				elog(ERROR, "bitmapset has multiple members");
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+#if BITS_PER_BITMAPWORD == 32
+			result += pg_rightmost_one32(w);
+#elif BITS_PER_BITMAPWORD == 64
+			result += pg_rightmost_one64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 		}
 	}
 	if (result < 0)
@@ -650,12 +577,13 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
 			if (result >= 0 || HAS_MULTIPLE_ONES(w))
 				return false;
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+#if BITS_PER_BITMAPWORD == 32
+			result += pg_rightmost_one32(w);
+#elif BITS_PER_BITMAPWORD == 64
+			result += pg_rightmost_one64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 		}
 	}
 	if (result < 0)
@@ -681,12 +609,16 @@ bms_num_members(const Bitmapset *a)
 	{
 		bitmapword	w = a->words[wordnum];
 
-		/* we assume here that bitmapword is an unsigned type */
-		while (w != 0)
-		{
-			result += number_of_ones[w & 255];
-			w >>= 8;
-		}
+		/* No need to count the bits in a zero word */
+		if (w == 0)
+			continue;
+#if BITS_PER_BITMAPWORD == 32
+		result += pg_popcount32(w);
+#elif BITS_PER_BITMAPWORD == 64
+		result += pg_popcount64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 	}
 	return result;
 }
@@ -1041,12 +973,13 @@ bms_first_member(Bitmapset *a)
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+#if BITS_PER_BITMAPWORD == 32
+			result += pg_rightmost_one32(w);
+#elif BITS_PER_BITMAPWORD == 64
+			result += pg_rightmost_one64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 			return result;
 		}
 	}
@@ -1096,12 +1029,13 @@ bms_next_member(const Bitmapset *a, int prevbit)
 			int			result;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+#if BITS_PER_BITMAPWORD == 32
+			result += pg_rightmost_one32(w);
+#elif BITS_PER_BITMAPWORD == 64
+			result += pg_rightmost_one64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 			return result;
 		}
 
@@ -1167,16 +1101,13 @@ bms_prev_member(const Bitmapset *a, int prevbit)
 
 		if (w != 0)
 		{
-			int			result;
-			int			shift = BITS_PER_BITMAPWORD - 8;
-
-			result = wordnum * BITS_PER_BITMAPWORD;
-
-			while ((w >> shift) == 0)
-				shift -= 8;
-
-			result += shift + leftmost_one_pos[(w >> shift) & 255];
-			return result;
+#if BITS_PER_BITMAPWORD == 32
+			return pg_leftmost_one32(w);
+#elif BITS_PER_BITMAPWORD == 64
+			return pg_leftmost_one64(w);
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 		}
 
 		/* in subsequent words, consider all bits */
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 20eead1798..40e71c6b56 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -11,7 +11,7 @@ include $(top_builddir)/src/Makefile.global
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o amutils.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o cryptohashes.o \
+	bitutils.o bool.o cash.o char.o cryptohashes.o \
 	date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o expandedrecord.o \
 	float.o format_type.o formatting.o genfile.o \
diff --git a/src/backend/utils/adt/bitutils.c b/src/backend/utils/adt/bitutils.c
new file mode 100644
index 0000000000..05b686cf33
--- /dev/null
+++ b/src/backend/utils/adt/bitutils.c
@@ -0,0 +1,424 @@
+/*-------------------------------------------------------------------------
+ *
+ * bitutils.c
+ *	  miscellaneous functions bit-wise operations.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/bitutils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "utils/bitutils.h"
+
+static const uint8 number_of_ones[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+};
+
+#if defined(HAVE__BUILTIN_POPCOUNT) || defined(HAVE__BUILTIN_POPCOUNTL)
+
+static bool
+pg_popcount_available(void)
+{
+
+	unsigned int exx[4] = { 0, 0, 0, 0 };
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
+}
+#endif
+
+#ifdef HAVE__BUILTIN_POPCOUNT
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+	if (pg_popcount_available())
+		pg_popcount32 = pg_popcount32_sse42;
+	else
+		pg_popcount32 = pg_popcount32_slow;
+
+	return pg_popcount32(word);
+}
+
+int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+
+#else
+
+int(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+
+#endif					/* HAVE__BUILTIN_POPCOUNT */
+
+/*
+ * pg_popcount32_slow
+ *		Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_slow(uint32 word)
+{
+	int result = 0;
+
+	while (word != 0)
+	{
+		result += number_of_ones[word & 255];
+		word >>= 8;
+	}
+
+	return result;
+}
+
+#ifdef HAVE__BUILTIN_POPCOUNTL
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_popcount64_choose(uint64 word)
+{
+	if (pg_popcount_available())
+		pg_popcount64 = pg_popcount64_sse42;
+	else
+		pg_popcount64 = pg_popcount64_slow;
+
+	return pg_popcount64(word);
+}
+
+int
+pg_popcount64_sse42(uint64 word)
+{
+	return __builtin_popcountl(word);
+}
+
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+
+#else
+
+int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+
+#endif					/* HAVE__BUILTIN_POPCOUNTL */
+
+/*
+ * pg_popcount64_slow
+ *		Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_slow(uint64 word)
+{
+	int result = 0;
+
+	while (word != 0)
+	{
+		result += number_of_ones[word & 255];
+		word >>= 8;
+	}
+
+	return result;
+}
+
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+
+#if defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CTZL) || defined(HAVE__BUILTIN_CLZ) || defined(HAVE__BUILTIN_CLZL)
+
+static bool
+pg_lzcnt_available(void)
+{
+
+	unsigned int exx[4] = { 0, 0, 0, 0 };
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
+}
+#endif
+
+#ifdef HAVE__BUILTIN_CTZ
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_rightmost_one32_choose(uint32 word)
+{
+	if (pg_lzcnt_available())
+		pg_rightmost_one32 = pg_rightmost_one32_abm;
+	else
+		pg_rightmost_one32 = pg_rightmost_one32_slow;
+
+	return pg_rightmost_one32(word);
+}
+
+int
+pg_rightmost_one32_abm(uint32 word)
+{
+	return __builtin_ctz(word);
+}
+
+int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
+
+#else
+
+int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
+
+#endif					/* HAVE__BUILTIN_CTZ */
+
+/*
+ * pg_rightmost_one32_slow
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+int
+pg_rightmost_one32_slow(uint32 word)
+{
+	int result = 0;
+
+	Assert(word != 0);
+
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+
+	return result;
+}
+
+#ifdef HAVE__BUILTIN_CTZL
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_rightmost_one64_choose(uint64 word)
+{
+	if (pg_lzcnt_available())
+		pg_rightmost_one64 = pg_rightmost_one64_abm;
+	else
+		pg_rightmost_one64 = pg_rightmost_one64_slow;
+
+	return pg_rightmost_one64(word);
+}
+
+int
+pg_rightmost_one64_abm(uint64 word)
+{
+	return __builtin_ctzl(word);
+}
+
+int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
+
+#else
+
+int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
+
+#endif					/* HAVE__BUILTIN_CTZL */
+
+/*
+ * pg_rightmost_one64_slow
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+int
+pg_rightmost_one64_slow(uint64 word)
+{
+	int result = 0;
+
+	Assert(word != 0);
+
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+
+	return result;
+}
+
+#ifdef HAVE__BUILTIN_CLZ
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_leftmost_one32_choose(uint32 word)
+{
+	if (pg_lzcnt_available())
+		pg_leftmost_one32 = pg_leftmost_one32_abm;
+	else
+		pg_leftmost_one32 = pg_leftmost_one32_slow;
+
+	return pg_leftmost_one32(word);
+}
+
+int
+pg_leftmost_one32_abm(uint32 word)
+{
+	return 31 - __builtin_clz(word);
+}
+
+int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
+
+#else
+
+int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
+
+#endif					/* HAVE__BUILTIN_CLZ */
+
+/*
+ * pg_leftmost_one32_slow
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+int
+pg_leftmost_one32_slow(uint32 word)
+{
+	int			shift = 32 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+}
+
+#ifdef HAVE__BUILTIN_CLZL
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_leftmost_one64_choose(uint64 word)
+{
+	if (pg_lzcnt_available())
+		pg_leftmost_one64 = pg_leftmost_one64_abm;
+	else
+		pg_leftmost_one64 = pg_leftmost_one64_slow;
+
+	return pg_leftmost_one64(word);
+}
+
+int
+pg_leftmost_one64_abm(uint64 word)
+{
+	return 63 - __builtin_clzl(word);
+}
+
+int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
+
+#else
+
+int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
+
+#endif					/* HAVE__BUILTIN_CLZL */
+
+/*
+ * pg_leftmost_one64_slow
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+int
+pg_leftmost_one64_slow(uint64 word)
+{
+	int			shift = 64 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+}
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 76bd81e9bf..862471266a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -751,6 +751,24 @@
 /* Define to 1 if your compiler understands __builtin_$op_overflow. */
 #undef HAVE__BUILTIN_OP_OVERFLOW
 
+/* Define to 1 if your compiler understands __builtin_popcount. */
+#undef HAVE__BUILTIN_POPCOUNT
+
+/* Define to 1 if your compiler understands __builtin_popcountl. */
+#undef HAVE__BUILTIN_POPCOUNTL
+
+/* Define to 1 if your compiler understands __builtin_ctz. */
+#undef HAVE__BUILTIN_CTZ
+
+/* Define to 1 if your compiler understands __builtin_ctzl. */
+#undef HAVE__BUILTIN_CTZL
+
+/* Define to 1 if your compiler understands __builtin_clz. */
+#undef HAVE__BUILTIN_CLZ
+
+/* Define to 1 if your compiler understands __builtin_clzl. */
+#undef HAVE__BUILTIN_CLZL
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index de0c4d9997..98639d7f83 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -590,6 +590,24 @@
 /* Define to 1 if your compiler understands __builtin_$op_overflow. */
 /* #undef HAVE__BUILTIN_OP_OVERFLOW */
 
+/* Define to 1 if your compiler understands __builtin_popcount. */
+/* #undef HAVE__BUILTIN_POPCOUNT */
+
+/* Define to 1 if your compiler understands __builtin_popcountl. */
+/* #undef HAVE__BUILTIN_POPCOUNTL */
+
+/* Define to 1 if your compiler understands __builtin_ctz. */
+/* #undef HAVE__BUILTIN_CTZ */
+
+/* Define to 1 if your compiler understands __builtin_ctzl. */
+/* #undef HAVE__BUILTIN_CTZL */
+
+/* Define to 1 if your compiler understands __builtin_clz. */
+/* #undef HAVE__BUILTIN_CLZ */
+
+/* Define to 1 if your compiler understands __builtin_clzl. */
+/* #undef HAVE__BUILTIN_CLZL */
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/utils/bitutils.h b/src/include/utils/bitutils.h
new file mode 100644
index 0000000000..96ce102bc9
--- /dev/null
+++ b/src/include/utils/bitutils.h
@@ -0,0 +1,52 @@
+/*------------------------------------------------------------------------ -
+ *
+ * bitutils.h
+ *	  miscellaneous functions bit-wise operations.
+  *
+ *
+ * Portions Copyright(c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/utils/bitutils.h
+ *
+ *------------------------------------------------------------------------ -
+ */
+
+#ifndef BITUTILS_H
+#define BITUTILS_H
+
+extern int pg_popcount32_slow(uint32 word);
+#ifdef HAVE__BUILTIN_POPCOUNT
+extern int pg_popcount32_sse42(uint32 word);
+#endif
+extern int pg_popcount64_slow(uint64 word);
+#ifdef HAVE__BUILTIN_POPCOUNTL
+extern int pg_popcount64_sse42(uint64 word);
+#endif
+
+extern int pg_rightmost_one32_slow(uint32 word);
+#ifdef HAVE__BUILTIN_CTZ
+extern int pg_rightmost_one32_abm(uint32 word);
+#endif
+extern int pg_rightmost_one64_slow(uint64 word);
+#ifdef HAVE__BUILTIN_CTZL
+extern int pg_rightmost_one64_abm(uint64 word);
+#endif
+
+extern int pg_leftmost_one32_slow(uint32 word);
+#ifdef HAVE__BUILTIN_CLZ
+extern int pg_leftmost_one32_abm(uint32 word);
+#endif
+extern int pg_leftmost_one64_slow(uint64 word);
+#ifdef HAVE__BUILTIN_CLZL
+extern int pg_leftmost_one64_abm(uint64 word);
+#endif
+
+
+extern int (*pg_popcount32) (uint32 word);
+extern int (*pg_popcount64) (uint64 word);
+extern int (*pg_rightmost_one32) (uint32 word);
+extern int (*pg_rightmost_one64) (uint64 word);
+extern int (*pg_leftmost_one32) (uint32 word);
+extern int (*pg_leftmost_one64) (uint64 word);
+
+#endif							/* BITUTILS_H */
-- 
2.16.2.windows.1

#2Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

On 20/12/2018 18:53, David Rowley wrote
[...]

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

[...]

Looking at the normalized standard deviations, the patched results have
a higher than 5% chance of being better simply by chance.  I suspect
that you have made an improvement, but the statistics are not convincing.

I can supply detailed working if you want.

Cheers,
Gavin

#3David Rowley
david.rowley@2ndquadrant.com
In reply to: Gavin Flower (#2)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, 20 Dec 2018 at 20:17, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:

Looking at the normalized standard deviations, the patched results have
a higher than 5% chance of being better simply by chance. I suspect
that you have made an improvement, but the statistics are not convincing.

Yeah, I'd hoped that I could have gotten a better signal to noise
ratio by running the test many times, but you're right. That was on
my laptop. I've run the test again on an AWS instance and the results
seem to be a bit more stable. Same table with 1 int column and 100m
rows. statistics set to 10.

Unpatched

postgres=# analyze a;

Time: 38.248 ms
Time: 35.185 ms
Time: 35.067 ms
Time: 34.879 ms
Time: 34.816 ms
Time: 34.558 ms
Time: 34.722 ms
Time: 34.427 ms
Time: 34.214 ms
Time: 34.301 ms
Time: 35.751 ms
Time: 33.993 ms
Time: 33.880 ms
Time: 33.617 ms
Time: 33.381 ms
Time: 33.326 ms

Patched:

postgres=# analyze a;

Time: 34.421 ms
Time: 33.523 ms
Time: 33.230 ms
Time: 33.678 ms
Time: 32.987 ms
Time: 32.914 ms
Time: 33.165 ms
Time: 32.707 ms
Time: 32.645 ms
Time: 32.814 ms
Time: 32.082 ms
Time: 32.143 ms
Time: 32.310 ms
Time: 31.966 ms
Time: 31.702 ms
Time: 32.089 ms

Avg +5.72%, Median +5.29%

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: David Rowley (#3)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, Dec 20, 2018 at 6:53 AM David Rowley <david.rowley@2ndquadrant.com> wrote:

Thomas mentions in [1], to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2] which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount(). I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems.

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1]https://arxiv.org/pdf/1611.07612.pdf, where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

#5Jose Luis Tallon
jltallon@adv-solutions.net
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

On 20/12/18 6:53, David Rowley wrote:

Back in 2016 [1] there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word. Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

[snip]

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member(). The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation. I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call.

IMVHO: Please do not disregard potential optimization by the compiler
around those calls.. o_0  That might explain the reduced performance
improvement observed.

Not that I can see any obvious alternative to your implementation right
away ...

That
seems to have mostly been done in Thomas' patch [3], part of which
I've used for the visibilitymap.c code changes. If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel.

I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test
on. Feel free to toss me some test code.

cpuinfo flags:    fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm
cmp_legacy 3dnowprefetch vmmcall

2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

Thanks

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Dmitry Dolgov (#4)
Re: Using POPCNT and other advanced bit manipulation instructions

Thanks for looking at this.

On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function. This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1], where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7David Rowley
david.rowley@2ndquadrant.com
In reply to: Jose Luis Tallon (#5)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, 20 Dec 2018 at 23:59, Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:

IMVHO: Please do not disregard potential optimization by the compiler
around those calls.. o_0 That might explain the reduced performance
improvement observed.

It was a speedup that I measured. Did you see something else?

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel.

I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test
on. Feel free to toss me some test code.

cpuinfo flags: fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm
cmp_legacy 3dnowprefetch vmmcall

2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

It would be really good if you could git clone a copy of master and
patch it with the patch from earlier in the thread and see if you
encounter any issues running make check-world.

I'm a bit uncertain if passing -mpopcnt to a recent gcc would result
in the popcnt instruction being compiled in if the machine doing the
compiling had no support for that.

Likely it would be simple to test that with:

echo "int main(char **argv, int argc) { return
__builtin_popcount(argc); }" > popcnt.c && gcc popcnt.c -S -mpopcnt &&
cat popcnt.s | grep pop

I see a "popcntl" in there on my machine.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Dmitry Dolgov
9erthalion6@gmail.com
In reply to: David Rowley (#6)
Re: Using POPCNT and other advanced bit manipulation instructions

On Fri, Jan 4, 2019 at 1:38 PM David Rowley <david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function. This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.

It surprises me, that it's not that obvious how to disable this feature for
clang. I guess one should be able to turn it off by invoking opt manually:

clang -S -mpopcnt -emit-llvm *.c
opt -S -mattr=+popcnt <all the options without -loop-idiom> *.ll
llc -mattr=+popcnt *.optimized.ll
clang -mpopcnt *optimized.s

But for some reason this doesn't work for me (popcnt is not appearing in
the first place).

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1], where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.

Yeah, probably you're right. If I understand correctly even with the lookup
table in the cache the access would be a bit slower than a POPCNT instruction.

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.
For another, I think the arrangement of all those "ifdef
HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd
lay them out like this:

#ifdef HAVE__BUILTIN_CTZ
int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
#else
int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
#endif

#ifdef HAVE__BUILTIN_CTZ
/*
* This gets called on the first call. It replaces the function pointer
* so that subsequent calls are routed directly to the chosen implementation.
*/
static int
pg_rightmost_one32_choose(uint32 word)
{
...

(You need declarations for the "choose" variants at the top of the file,
but that seems okay.)

Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
put at the top of the file something like

#if bms are 32 bits
#define pg_rightmost_one(x) pg_rightmost_one32(x)
#define pg_popcount(x) pg_popcount32(x)
#elif they are 64 bits
#define ...
#else
#error ...
#endif

This way, each place that uses the functions does not need the ifdefs.

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#9)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, Jan 31, 2019 at 07:45:02PM -0300, Alvaro Herrera wrote:

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

Moved to next CF, waiting on author. I think that this needs more
reviews.
--
Michael

#11David Rowley
david.rowley@2ndquadrant.com
In reply to: Alvaro Herrera (#9)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

Thanks for looking at this.

On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.

I've moved the code into src/port and renamed the file to pg_bitutils.c

For another, I think the arrangement of all those "ifdef
HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd
lay them out like this:

I've made this change too, although when doing it I realised that I
had forgotten to include the check for CPUID. It's possible that does
not exist but POPCNT does, I guess. This has made the #ifs a bit more
complex.

Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
put at the top of the file something like

Yeah, agreed. Much neater that way.

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

I also made a number of other changes to the patch.

1. The patch now only uses the -mpopcnt CFLAG for pg_bitutils.c. I
thought this was important so we don't expose that flag in pg_config
and possibly end up building extension with popcnt instructions, which
might not be portable to other older hardware.
2. Wrote a new pg_popcnt function that accepts an array of bytes and a
size variable. This seems useful for the bloomfilter use.

There are still various number_of_ones[] arrays around the codebase.
These exist in tsgistidx.c, _intbig_gist.c and _ltree_gist.c. It
would be nice to get rid of those too, but one of the usages in each
of those 3 files requires XORing with another bit array before
counting the bits. I thought about maybe writing a pop_count_xor()
function that accepts 2 byte arrays and a length parameter, but it
seems a bit special case, so I didn't.

Another thing I wasn't sure of was if I should just have
bms_num_members() just call pg_popcount(). It might be worth
benchmarking to see what's faster. My thinking is that pg_popcount
will inline the pg_popcount64() call so it would mean a single
function call rather than one for each bitmapword in the set.

I've compiled and run make check-world on Linux with GCC7.3 and
clang6.0. I've also tested on MSVC to ensure I didn't break windows.
It would be good to get a few more people to compile it and run the
tests.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchapplication/octet-stream; name=v2-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchDownload
From 64818390701a3787266fc95a39aaaf787cc906e2 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 20 Dec 2018 17:46:35 +1300
Subject: [PATCH v2] Add basic support for using the POPCNT and SSE4.2s LZCNT
 opcodes

These opcodes have been around in the AMD world since 2007, and 2008 in
the case of intel. They're supported in GCC and Clang via some __builtin
macros.  The opcodes may be unavailable during runtime, in which case we
fall back on a C-based implementation of the code.  In order to get the
POPCNT instruction we must pass the -mpopcnt option to the compiler.  We
do this only for the pg_bitutils.c file.

David Rowley (with fragments taken from a patch by Thomas Munro)
---
 config/c-compiler.m4                    | 116 +++++++
 configure                               | 155 ++++++++++
 configure.in                            |   8 +
 src/Makefile.global.in                  |   1 +
 src/backend/access/heap/visibilitymap.c |  73 ++---
 src/backend/lib/bloomfilter.c           |  15 +-
 src/backend/nodes/bitmapset.c           | 131 ++------
 src/include/pg_config.h.in              |  18 ++
 src/include/pg_config.h.win32           |  18 ++
 src/include/port/pg_bitutils.h          |  26 ++
 src/port/Makefile                       |   5 +-
 src/port/pg_bitutils.c                  | 516 ++++++++++++++++++++++++++++++++
 src/tools/msvc/Mkvcbuild.pm             |   1 +
 13 files changed, 914 insertions(+), 169 deletions(-)
 create mode 100644 src/include/port/pg_bitutils.h
 create mode 100644 src/port/pg_bitutils.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index af2dea1c2a..7cdcaee0b2 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -378,6 +378,122 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 
 
 
+# PGAC_C_BUILTIN_POPCOUNT
+# -------------------------
+AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcount])])dnl
+AC_CACHE_CHECK([for __builtin_popcount], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mpopcnt"
+AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);])],
+[Ac_cachevar=yes],
+[Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="-mpopcnt"
+AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
+          [Define to 1 if your compiler understands __builtin_popcount.])
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_C_BUILTIN_POPCOUNT
+
+
+
+# PGAC_C_BUILTIN_POPCOUNTL
+# -------------------------
+AC_DEFUN([PGAC_C_BUILTIN_POPCOUNTL],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcountl])])dnl
+AC_CACHE_CHECK([for __builtin_popcountl], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mpopcnt"
+AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcountl(255);])],
+[Ac_cachevar=yes],
+[Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="-mpopcnt"
+AC_DEFINE(HAVE__BUILTIN_POPCOUNTL, 1,
+          [Define to 1 if your compiler understands __builtin_popcountl.])
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_C_BUILTIN_POPCOUNTL
+
+
+
+# PGAC_C_BUILTIN_CTZ
+# -------------------------
+# Check if the C compiler understands __builtin_ctz(),
+# and define HAVE__BUILTIN_CTZ if so.
+AC_DEFUN([PGAC_C_BUILTIN_CTZ],
+[AC_CACHE_CHECK(for __builtin_ctz, pgac_cv__builtin_ctz,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_ctz(256);]
+)],
+[pgac_cv__builtin_ctz=yes],
+[pgac_cv__builtin_ctz=no])])
+if test x"$pgac_cv__builtin_ctz" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CTZ, 1,
+          [Define to 1 if your compiler understands __builtin_ctz.])
+fi])# PGAC_C_BUILTIN_CTZ
+
+
+
+# PGAC_C_BUILTIN_CTZL
+# -------------------------
+# Check if the C compiler understands __builtin_ctzl(),
+# and define HAVE__BUILTIN_CTZL if so.
+AC_DEFUN([PGAC_C_BUILTIN_CTZL],
+[AC_CACHE_CHECK(for __builtin_ctzl, pgac_cv__builtin_ctzl,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_ctzl(256);]
+)],
+[pgac_cv__builtin_ctzl=yes],
+[pgac_cv__builtin_ctzl=no])])
+if test x"$pgac_cv__builtin_ctzl" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CTZL, 1,
+          [Define to 1 if your compiler understands __builtin_ctzl.])
+fi])# PGAC_C_BUILTIN_CTZL
+
+
+
+# PGAC_C_BUILTIN_CLZ
+# -------------------------
+# Check if the C compiler understands __builtin_clz(),
+# and define HAVE__BUILTIN_CLZ if so.
+AC_DEFUN([PGAC_C_BUILTIN_CLZ],
+[AC_CACHE_CHECK(for __builtin_clz, pgac_cv__builtin_clz,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_clz(256);]
+)],
+[pgac_cv__builtin_clz=yes],
+[pgac_cv__builtin_clz=no])])
+if test x"$pgac_cv__builtin_clz" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CLZ, 1,
+          [Define to 1 if your compiler understands __builtin_clz.])
+fi])# PGAC_C_BUILTIN_CLZ
+
+
+
+# PGAC_C_BUILTIN_CLZL
+# -------------------------
+# Check if the C compiler understands __builtin_clzl(),
+# and define HAVE__BUILTIN_CLZL if so.
+AC_DEFUN([PGAC_C_BUILTIN_CLZL],
+[AC_CACHE_CHECK(for __builtin_clzl, pgac_cv__builtin_clzl,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_clzl(256);]
+)],
+[pgac_cv__builtin_clzl=yes],
+[pgac_cv__builtin_clzl=no])])
+if test x"$pgac_cv__builtin_clzl" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CLZL, 1,
+          [Define to 1 if your compiler understands __builtin_clzl.])
+fi])# PGAC_C_BUILTIN_CLZL
+
+
+
 # PGAC_C_BUILTIN_UNREACHABLE
 # --------------------------
 # Check if the C compiler understands __builtin_unreachable(),
diff --git a/configure b/configure
index ddb3c8b1ba..0e2f8da274 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
+CFLAGS_POPCNT
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -14057,6 +14058,158 @@ if test x"$pgac_cv__builtin_constant_p" = xyes ; then
 
 $as_echo "#define HAVE__BUILTIN_CONSTANT_P 1" >>confdefs.h
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcount" >&5
+$as_echo_n "checking for __builtin_popcount... " >&6; }
+if ${pgac_cv_popcount+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mpopcnt"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_popcount(255);
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_popcount=yes
+else
+  pgac_cv_popcount=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_popcount" >&5
+$as_echo "$pgac_cv_popcount" >&6; }
+if test x"$pgac_cv_popcount" = x"yes"; then
+  CFLAGS_POPCNT="-mpopcnt"
+
+$as_echo "#define HAVE__BUILTIN_POPCOUNT 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcountl" >&5
+$as_echo_n "checking for __builtin_popcountl... " >&6; }
+if ${pgac_cv_popcountl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mpopcnt"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_popcountl(255);
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_popcountl=yes
+else
+  pgac_cv_popcountl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_popcountl" >&5
+$as_echo "$pgac_cv_popcountl" >&6; }
+if test x"$pgac_cv_popcountl" = x"yes"; then
+  CFLAGS_POPCNT="-mpopcnt"
+
+$as_echo "#define HAVE__BUILTIN_POPCOUNTL 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
+$as_echo_n "checking for __builtin_ctz... " >&6; }
+if ${pgac_cv__builtin_ctz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_ctz(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_ctz=yes
+else
+  pgac_cv__builtin_ctz=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_ctz" >&5
+$as_echo "$pgac_cv__builtin_ctz" >&6; }
+if test x"$pgac_cv__builtin_ctz" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CTZ 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctzl" >&5
+$as_echo_n "checking for __builtin_ctzl... " >&6; }
+if ${pgac_cv__builtin_ctzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_ctzl(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_ctzl=yes
+else
+  pgac_cv__builtin_ctzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_ctzl" >&5
+$as_echo "$pgac_cv__builtin_ctzl" >&6; }
+if test x"$pgac_cv__builtin_ctzl" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CTZL 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clz" >&5
+$as_echo_n "checking for __builtin_clz... " >&6; }
+if ${pgac_cv__builtin_clz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_clz(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_clz=yes
+else
+  pgac_cv__builtin_clz=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clz" >&5
+$as_echo "$pgac_cv__builtin_clz" >&6; }
+if test x"$pgac_cv__builtin_clz" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CLZ 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static int x = __builtin_clzl(256);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"$pgac_cv__builtin_clzl" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_CLZL 1" >>confdefs.h
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_unreachable" >&5
 $as_echo_n "checking for __builtin_unreachable... " >&6; }
@@ -14575,6 +14728,8 @@ $as_echo "#define LOCALE_T_IN_XLOCALE 1" >>confdefs.h
 
 fi
 
+
+
 # MSVC doesn't cope well with defining restrict to __restrict, the
 # spelling it understands, because it conflicts with
 # __declspec(restrict). Therefore we define pg_restrict to the
diff --git a/configure.in b/configure.in
index 3d8888805c..e4af9aa1a8 100644
--- a/configure.in
+++ b/configure.in
@@ -1482,6 +1482,12 @@ PGAC_C_BUILTIN_BSWAP16
 PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
+PGAC_C_BUILTIN_POPCOUNT
+PGAC_C_BUILTIN_POPCOUNTL
+PGAC_C_BUILTIN_CTZ
+PGAC_C_BUILTIN_CTZL
+PGAC_C_BUILTIN_CLZ
+PGAC_C_BUILTIN_CLZL
 PGAC_C_BUILTIN_UNREACHABLE
 PGAC_C_COMPUTED_GOTO
 PGAC_STRUCT_TIMEZONE
@@ -1496,6 +1502,8 @@ AC_TYPE_LONG_LONG_INT
 
 PGAC_TYPE_LOCALE_T
 
+AC_SUBST(CFLAGS_POPCNT)
+
 # MSVC doesn't cope well with defining restrict to __restrict, the
 # spelling it understands, because it conflicts with
 # __declspec(restrict). Therefore we define pg_restrict to the
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 6852853041..bd515376b3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -260,6 +260,7 @@ CXX = @CXX@
 CFLAGS = @CFLAGS@
 CFLAGS_VECTOR = @CFLAGS_VECTOR@
 CFLAGS_SSE42 = @CFLAGS_SSE42@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
 CFLAGS_ARMV8_CRC32C = @CFLAGS_ARMV8_CRC32C@
 CXXFLAGS = @CXXFLAGS@
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 931ae81fd6..9657cd0a63 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -89,12 +89,12 @@
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
 
-
 /*#define TRACE_VISIBILITYMAP */
 
 /*
@@ -115,43 +115,9 @@
 #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
 #define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
 
-/* tables for fast counting of set bits for visible and frozen */
-static const uint8 number_of_ones_for_visible[256] = {
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
-	1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
-	2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
-};
-static const uint8 number_of_ones_for_frozen[256] = {
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
-	2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
-};
+/* Masks for bit counting bits in the visibility map. */
+#define VISIBLE_MASK64 0x5555555555555555 /* The lower bit of each bit pair */
+#define FROZEN_MASK64 0xaaaaaaaaaaaaaaaa /* The upper bit of each bit pair */
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -408,18 +374,16 @@ void
 visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
 {
 	BlockNumber mapBlock;
+	BlockNumber nvisible = 0;
+	BlockNumber nfrozen = 0;
 
 	/* all_visible must be specified */
 	Assert(all_visible);
 
-	*all_visible = 0;
-	if (all_frozen)
-		*all_frozen = 0;
-
 	for (mapBlock = 0;; mapBlock++)
 	{
 		Buffer		mapBuffer;
-		unsigned char *map;
+		uint64	   *map;
 		int			i;
 
 		/*
@@ -436,17 +400,30 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
 		 * immediately stale anyway if anyone is concurrently setting or
 		 * clearing bits, and we only really need an approximate value.
 		 */
-		map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
+		map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
 
-		for (i = 0; i < MAPSIZE; i++)
+		StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
+						 "unsupported MAPSIZE");
+		if (all_frozen == NULL)
+		{
+			for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
+				nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+		}
+		else
 		{
-			*all_visible += number_of_ones_for_visible[map[i]];
-			if (all_frozen)
-				*all_frozen += number_of_ones_for_frozen[map[i]];
+			for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
+			{
+				nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+				nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
+			}
 		}
 
 		ReleaseBuffer(mapBuffer);
 	}
+
+	*all_visible = nvisible;
+	if (all_frozen)
+		*all_frozen = nfrozen;
 }
 
 /*
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
index 1e907cabc6..e2c1276f21 100644
--- a/src/backend/lib/bloomfilter.c
+++ b/src/backend/lib/bloomfilter.c
@@ -37,6 +37,7 @@
 
 #include "access/hash.h"
 #include "lib/bloomfilter.h"
+#include "port/pg_bitutils.h"
 
 #define MAX_HASH_FUNCS		10
 
@@ -187,19 +188,7 @@ double
 bloom_prop_bits_set(bloom_filter *filter)
 {
 	int			bitset_bytes = filter->m / BITS_PER_BYTE;
-	uint64		bits_set = 0;
-	int			i;
-
-	for (i = 0; i < bitset_bytes; i++)
-	{
-		unsigned char byte = filter->bitset[i];
-
-		while (byte)
-		{
-			bits_set++;
-			byte &= (byte - 1);
-		}
-	}
+	uint64		bits_set = pg_popcount((char *) filter->bitset, bitset_bytes);
 
 	return bits_set / (double) filter->m;
 }
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 62cd00903c..d0380abf3e 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -22,6 +22,7 @@
 
 #include "access/hash.h"
 #include "nodes/pg_list.h"
+#include "port/pg_bitutils.h"
 
 
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
@@ -51,79 +52,23 @@
 
 #define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
 
+/* Set the bitwise macro version we must use based on the bitmapword size */
+#if BITS_PER_BITMAPWORD == 32
 
-/*
- * Lookup tables to avoid need for bit-by-bit groveling
- *
- * rightmost_one_pos[x] gives the bit number (0-7) of the rightmost one bit
- * in a nonzero byte value x.  The entry for x=0 is never used.
- *
- * leftmost_one_pos[x] gives the bit number (0-7) of the leftmost one bit in a
- * nonzero byte value x.  The entry for x=0 is never used.
- *
- * number_of_ones[x] gives the number of one-bits (0-8) in a byte value x.
- *
- * We could make these tables larger and reduce the number of iterations
- * in the functions that use them, but bytewise shifts and masks are
- * especially fast on many machines, so working a byte at a time seems best.
- */
+#define bmw_popcount(w)			pg_popcount32(w)
+#define bmw_rightmost_one(w)	pg_rightmost_one32(w)
+#define bmw_leftmost_one(w)		pg_leftmost_one32(w)
+
+#elif BITS_PER_BITMAPWORD == 64
+
+#define bmw_popcount(w)			pg_popcount64(w)
+#define bmw_rightmost_one(w)	pg_rightmost_one64(w)
+#define bmw_leftmost_one(w)		pg_leftmost_one64(w)
+
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
-static const uint8 rightmost_one_pos[256] = {
-	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
-};
-
-static const uint8 leftmost_one_pos[256] = {
-	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
-	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
-};
-
-static const uint8 number_of_ones[256] = {
-	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
-	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
-};
 
 
 /*
@@ -607,12 +552,7 @@ bms_singleton_member(const Bitmapset *a)
 			if (result >= 0 || HAS_MULTIPLE_ONES(w))
 				elog(ERROR, "bitmapset has multiple members");
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+			result += bmw_rightmost_one(w);
 		}
 	}
 	if (result < 0)
@@ -650,12 +590,7 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
 			if (result >= 0 || HAS_MULTIPLE_ONES(w))
 				return false;
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+			result += bmw_rightmost_one(w);
 		}
 	}
 	if (result < 0)
@@ -681,12 +616,9 @@ bms_num_members(const Bitmapset *a)
 	{
 		bitmapword	w = a->words[wordnum];
 
-		/* we assume here that bitmapword is an unsigned type */
-		while (w != 0)
-		{
-			result += number_of_ones[w & 255];
-			w >>= 8;
-		}
+		/* No need to count the bits in a zero word */
+		if (w != 0)
+			result += bmw_popcount(w);
 	}
 	return result;
 }
@@ -1041,12 +973,7 @@ bms_first_member(Bitmapset *a)
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+			result += bmw_rightmost_one(w);
 			return result;
 		}
 	}
@@ -1096,12 +1023,7 @@ bms_next_member(const Bitmapset *a, int prevbit)
 			int			result;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
-			while ((w & 255) == 0)
-			{
-				w >>= 8;
-				result += 8;
-			}
-			result += rightmost_one_pos[w & 255];
+			result += bmw_rightmost_one(w);
 			return result;
 		}
 
@@ -1168,14 +1090,9 @@ bms_prev_member(const Bitmapset *a, int prevbit)
 		if (w != 0)
 		{
 			int			result;
-			int			shift = BITS_PER_BITMAPWORD - 8;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
-
-			while ((w >> shift) == 0)
-				shift -= 8;
-
-			result += shift + leftmost_one_pos[(w >> shift) & 255];
+			result += bmw_leftmost_one(w);
 			return result;
 		}
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 82547f321f..e3b461a68c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -751,6 +751,24 @@
 /* Define to 1 if your compiler understands __builtin_$op_overflow. */
 #undef HAVE__BUILTIN_OP_OVERFLOW
 
+/* Define to 1 if your compiler understands __builtin_popcount. */
+#undef HAVE__BUILTIN_POPCOUNT
+
+/* Define to 1 if your compiler understands __builtin_popcountl. */
+#undef HAVE__BUILTIN_POPCOUNTL
+
+/* Define to 1 if your compiler understands __builtin_ctz. */
+#undef HAVE__BUILTIN_CTZ
+
+/* Define to 1 if your compiler understands __builtin_ctzl. */
+#undef HAVE__BUILTIN_CTZL
+
+/* Define to 1 if your compiler understands __builtin_clz. */
+#undef HAVE__BUILTIN_CLZ
+
+/* Define to 1 if your compiler understands __builtin_clzl. */
+#undef HAVE__BUILTIN_CLZL
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index a3c44f0fd8..e85b42b57d 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -590,6 +590,24 @@
 /* Define to 1 if your compiler understands __builtin_$op_overflow. */
 /* #undef HAVE__BUILTIN_OP_OVERFLOW */
 
+/* Define to 1 if your compiler understands __builtin_popcount. */
+/* #undef HAVE__BUILTIN_POPCOUNT */
+
+/* Define to 1 if your compiler understands __builtin_popcountl. */
+/* #undef HAVE__BUILTIN_POPCOUNTL */
+
+/* Define to 1 if your compiler understands __builtin_ctz. */
+/* #undef HAVE__BUILTIN_CTZ */
+
+/* Define to 1 if your compiler understands __builtin_ctzl. */
+/* #undef HAVE__BUILTIN_CTZL */
+
+/* Define to 1 if your compiler understands __builtin_clz. */
+/* #undef HAVE__BUILTIN_CLZ */
+
+/* Define to 1 if your compiler understands __builtin_clzl. */
+/* #undef HAVE__BUILTIN_CLZL */
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
new file mode 100644
index 0000000000..148c555057
--- /dev/null
+++ b/src/include/port/pg_bitutils.h
@@ -0,0 +1,26 @@
+/*------------------------------------------------------------------------ -
+ *
+ * pg_bitutils.h
+ *	  miscellaneous functions for bit-wise operations.
+  *
+ *
+ * Portions Copyright(c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_bitutils.h
+ *
+ *------------------------------------------------------------------------ -
+ */
+
+#ifndef PG_BITUTILS_H
+#define PG_BITUTILS_H
+
+extern int (*pg_popcount32) (uint32 word);
+extern int (*pg_popcount64) (uint64 word);
+extern int (*pg_rightmost_one32) (uint32 word);
+extern int (*pg_rightmost_one64) (uint64 word);
+extern int (*pg_leftmost_one32) (uint32 word);
+extern int (*pg_leftmost_one64) (uint64 word);
+
+extern uint64 pg_popcount(const char *buf, int bytes);
+
+#endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 9cfc0f9279..df2e26b0a1 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -36,7 +36,7 @@ override CPPFLAGS := -I$(top_builddir)/src/port -DFRONTEND $(CPPFLAGS)
 LIBS += $(PTHREAD_LIBS)
 
 OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
-	noblock.o path.o pgcheckdir.o pgmkdirp.o pgsleep.o \
+	noblock.o path.o pg_bitutils.o pgcheckdir.o pgmkdirp.o pgsleep.o \
 	pg_strong_random.o pgstrcasecmp.o pgstrsignal.o pqsignal.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
@@ -73,6 +73,9 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_SSE42)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)
 
+# pg_bitutils.c needs CFLAGS_POPCNT
+pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_ARMV8_CRC32C
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
new file mode 100644
index 0000000000..97422e0504
--- /dev/null
+++ b/src/port/pg_bitutils.c
@@ -0,0 +1,516 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_bitutils.c
+ *	  miscellaneous functions for bit-wise operations.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_bitutils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_POPCOUNT) || defined(HAVE__BUILTIN_POPCOUNTL))
+static bool pg_popcount_available(void);
+#endif
+
+#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount32_sse42(uint32 word);
+#endif
+static int pg_popcount32_slow(uint32 word);
+
+#if defined(HAVE__BUILTIN_POPCOUNTL) && defined(HAVE__GET_CPUID)
+static int pg_popcount64_choose(uint64 word);
+static int pg_popcount64_sse42(uint64 word);
+#endif
+static int pg_popcount64_slow(uint64 word);
+
+#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CTZL) || defined(HAVE__BUILTIN_CLZ) || defined(HAVE__BUILTIN_CLZL))
+static bool pg_lzcnt_available(void);
+#endif
+
+#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
+static int pg_rightmost_one32_choose(uint32 word);
+static int pg_rightmost_one32_abm(uint32 word);
+#endif
+static int pg_rightmost_one32_slow(uint32 word);
+
+#if defined(HAVE__BUILTIN_CTZL) && defined(HAVE__GET_CPUID)
+static int pg_rightmost_one64_choose(uint64 word);
+static int pg_rightmost_one64_abm(uint64 word);
+#endif
+static int pg_rightmost_one64_slow(uint64 word);
+
+#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
+static int pg_leftmost_one32_choose(uint32 word);
+static int pg_leftmost_one32_abm(uint32 word);
+#endif
+static int pg_leftmost_one32_slow(uint32 word);
+
+#if defined(HAVE__BUILTIN_CLZL) && defined(HAVE__GET_CPUID)
+static int pg_leftmost_one64_choose(uint64 word);
+static int pg_leftmost_one64_abm(uint64 word);
+#endif
+static int pg_leftmost_one64_slow(uint64 word);
+
+#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+#else
+int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+#endif
+
+#if defined(HAVE__BUILTIN_POPCOUNTL) && defined(HAVE__GET_CPUID)
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+#else
+int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+#endif
+
+#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
+int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
+#else
+int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
+#endif
+
+#if defined(HAVE__BUILTIN_CTZL) && defined(HAVE__GET_CPUID)
+int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
+#else
+int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
+#endif
+
+#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
+int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
+#else
+int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
+#endif
+
+#if defined(HAVE__BUILTIN_CLZL) && defined(HAVE__GET_CPUID)
+int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
+#else
+int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
+#endif
+
+
+/* Array marking the number of 1-bits for each value of 0-255. */
+static const uint8 number_of_ones[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+};
+
+/*
+ * Array marking the position of the right-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+
+/*
+ * Array marking the position of the left-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+
+#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_POPCOUNT) || defined(HAVE__BUILTIN_POPCOUNTL))
+
+static bool
+pg_popcount_available(void)
+{
+	unsigned int exx[4] = { 0, 0, 0, 0 };
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
+}
+#endif
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+	if (pg_popcount_available())
+		pg_popcount32 = pg_popcount32_sse42;
+	else
+		pg_popcount32 = pg_popcount32_slow;
+
+	return pg_popcount32(word);
+}
+
+static int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+#endif
+
+/*
+ * pg_popcount32_slow
+ *		Return the number of 1 bits set in word
+ */
+static int
+pg_popcount32_slow(uint32 word)
+{
+	int result = 0;
+
+	while (word != 0)
+	{
+		result += number_of_ones[word & 255];
+		word >>= 8;
+	}
+
+	return result;
+}
+
+/*
+ * pg_popcount
+ *		Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+	uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+	/* Process in 64-bit chunks if the buffer is aligned. */
+	if (buf == (char *) TYPEALIGN(8, buf))
+	{
+		uint64 *words = (uint64 *) buf;
+
+		while (bytes >= 8)
+		{
+			popcnt += pg_popcount64(*words++);
+			bytes -= 8;
+		}
+
+		buf = (char *) words;
+	}
+#else
+	/* Process in 32-bit chunks if the buffer is aligned. */
+	if (buf == (char *) TYPEALIGN(4, buf))
+	{
+		uint32 *words = (uint32 *) buf;
+
+		while (bytes >= 4)
+		{
+			popcnt += pg_popcount32(*words++);
+			bytes -= 4;
+		}
+
+		buf = (char *) words;
+	}
+#endif
+
+	/* Process any remaining bytes */
+	while (bytes--)
+		popcnt += number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNTL)
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_popcount64_choose(uint64 word)
+{
+	if (pg_popcount_available())
+		pg_popcount64 = pg_popcount64_sse42;
+	else
+		pg_popcount64 = pg_popcount64_slow;
+
+	return pg_popcount64(word);
+}
+
+static int
+pg_popcount64_sse42(uint64 word)
+{
+	return __builtin_popcountl(word);
+}
+
+#endif
+
+/*
+ * pg_popcount64_slow
+ *		Return the number of 1 bits set in word
+ */
+static int
+pg_popcount64_slow(uint64 word)
+{
+	int result = 0;
+
+	while (word != 0)
+	{
+		result += number_of_ones[word & 255];
+		word >>= 8;
+	}
+
+	return result;
+}
+
+#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CTZL) || defined(HAVE__BUILTIN_CLZ) || defined(HAVE__BUILTIN_CLZL))
+
+static bool
+pg_lzcnt_available(void)
+{
+
+	unsigned int exx[4] = { 0, 0, 0, 0 };
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
+}
+#endif
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_rightmost_one32_choose(uint32 word)
+{
+	if (pg_lzcnt_available())
+		pg_rightmost_one32 = pg_rightmost_one32_abm;
+	else
+		pg_rightmost_one32 = pg_rightmost_one32_slow;
+
+	return pg_rightmost_one32(word);
+}
+
+static int
+pg_rightmost_one32_abm(uint32 word)
+{
+	return __builtin_ctz(word);
+}
+
+#endif
+
+/*
+ * pg_rightmost_one32_slow
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static int
+pg_rightmost_one32_slow(uint32 word)
+{
+	int result = 0;
+
+	Assert(word != 0);
+
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+
+	return result;
+}
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZL)
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_rightmost_one64_choose(uint64 word)
+{
+	if (pg_lzcnt_available())
+		pg_rightmost_one64 = pg_rightmost_one64_abm;
+	else
+		pg_rightmost_one64 = pg_rightmost_one64_slow;
+
+	return pg_rightmost_one64(word);
+}
+
+static int
+pg_rightmost_one64_abm(uint64 word)
+{
+	return __builtin_ctzl(word);
+}
+#endif
+
+/*
+ * pg_rightmost_one64_slow
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static int
+pg_rightmost_one64_slow(uint64 word)
+{
+	int result = 0;
+
+	Assert(word != 0);
+
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+
+	return result;
+}
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_leftmost_one32_choose(uint32 word)
+{
+	if (pg_lzcnt_available())
+		pg_leftmost_one32 = pg_leftmost_one32_abm;
+	else
+		pg_leftmost_one32 = pg_leftmost_one32_slow;
+
+	return pg_leftmost_one32(word);
+}
+
+static int
+pg_leftmost_one32_abm(uint32 word)
+{
+	return 31 - __builtin_clz(word);
+}
+#endif
+
+/*
+ * pg_leftmost_one32_slow
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static int
+pg_leftmost_one32_slow(uint32 word)
+{
+	int			shift = 32 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+}
+
+#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZL)
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_leftmost_one64_choose(uint64 word)
+{
+	if (pg_lzcnt_available())
+		pg_leftmost_one64 = pg_leftmost_one64_abm;
+	else
+		pg_leftmost_one64 = pg_leftmost_one64_slow;
+
+	return pg_leftmost_one64(word);
+}
+
+static int
+pg_leftmost_one64_abm(uint64 word)
+{
+	return 63 - __builtin_clzl(word);
+}
+#endif
+
+/*
+ * pg_leftmost_one64_slow
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static int
+pg_leftmost_one64_slow(uint64 word)
+{
+	int			shift = 64 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 56192f1b20..b688111801 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -110,6 +110,7 @@ sub mkvcbuild
 		push(@pgportfiles, 'pg_crc32c_sse42_choose.c');
 		push(@pgportfiles, 'pg_crc32c_sse42.c');
 		push(@pgportfiles, 'pg_crc32c_sb8.c');
+		push(@pgportfiles, 'pg_bitutils.c');
 	}
 	else
 	{
-- 
2.16.2.windows.1

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#11)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-04, David Rowley wrote:

On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.

I've moved the code into src/port and renamed the file to pg_bitutils.c

I've pushed this now. Let's see what the buildfarm has to say about it.

I've compiled and run make check-world on Linux with GCC7.3 and
clang6.0. I've also tested on MSVC to ensure I didn't break windows.
It would be good to get a few more people to compile it and run the
tests.

That's what the buildfarm is for ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#12)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Using POPCNT and other advanced bit manipulation instructions

Hi,

On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about

it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to say that reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures, especially because the latter two might be hard to come by outside the bf).

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

Andres Freund <andres@anarazel.de> writes:

I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to say that reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures, especially because the latter two might be hard to come by outside the bf).

The pink doesn't appear to be limited to non-mainstream platforms,
see eg lapwing, fulmar. However, I see Andrew just pushed some fixes,
so this argument is moot pending how much that helps.

regards, tom lane

#16Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

"Andres" == Andres Freund <andres@anarazel.de> writes:

It's likely to be hard to tell, given the amount of pink from the
Ryu patch. If Andrew is not planning to clean that up PDQ,

Besides crake (x-version), fulmar (icc) and lorikeet (cygwin), I hope
the rest of the known failures should pass on the next cycle; the
mac/ppc failures are because we redefine "bool" in a way that broke the
upstream code's c99 assumptions, and the rest are numerical instability
in ts_rank.

I'd suggest reverting that patch pending having some repairs for it.

Andres> I'd assume that breaking bit counting would cause distinct
Andres> enough damage (compile time or crashes). That's not to say that
Andres> reverting ryu shouldn't be considered (although I'm not that
Andres> bothered by cross version, ia64 and cygwin failures, especially
Andres> because the latter two might be hard to come by outside the
Andres> bf).

IA64 is working fine as far as I can see (specifically, anole is
passing); it's ICC on x86_64 that broke (fulmar).

--
Andrew (irc:RhodiumToad)

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Andres Freund wrote:

Hi,

On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about

it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

I'd assume that breaking bit counting would cause distinct enough
damage (compile time or crashes).

I was a bit surprised to find out that the assembly generated by
compiling the code in test for __builtin_foo() does not actually include
the calls being tested ... (they're only used to generate the value for
a static variable, and that gets optimized away); but then the comment
for the test does say that we're only testing that the compiler
understands the construct, so I suppose that's fine. Also, we already
do that for bswap.

This "compiler explorer" tool is nice:

https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,source:&#39;int+main(int+argc+,+char**+argv)%0A%7B%0A%0A++return++__builtin_popcountll((unsigned)argc)%3B%0A%0A%7D&#39;),l:&#39;5&#39;,n:&#39;0&#39;,o:&#39;C%2B%2B+source+%231&#39;,t:&#39;0&#39;)),k:45.01119572686435,l:&#39;4&#39;,m:100,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;),(g:!((h:compiler,i:(compiler:g447,filters:(b:&#39;0&#39;,commentOnly:&#39;0&#39;,directives:&#39;0&#39;,intel:&#39;0&#39;),libs:!(),options:&#39;-Wall+-O3+-msse4.2&#39;,source:1),l:&#39;5&#39;,n:&#39;0&#39;,o:&#39;x86-64+gcc+4.4.7+(Editor+%231,+Compiler+%231)&#39;,t:&#39;0&#39;)),k:54.98880427313565,l:&#39;4&#39;,m:100,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;)),l:&#39;2&#39;,n:&#39;0&#39;,o:&#39;&#39;,t:&#39;0&#39;)),version:4

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Andrew Gierth (#16)
Re: Using POPCNT and other advanced bit manipulation instructions

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> IA64 is working fine as far as I can see (specifically, anole
Andrew> is passing); it's ICC on x86_64 that broke (fulmar).

And I know what's wrong on fulmar now, so that'll be fixed shortly.

--
Andrew (irc:RhodiumToad)

#19Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Alvaro Herrera (#12)
Re: Using POPCNT and other advanced bit manipulation instructions

"Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Alvaro> I've pushed this now. Let's see what the buildfarm has to say
Alvaro> about it.

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

--
Andrew (irc:RhodiumToad)

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Gierth (#19)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Andrew Gierth wrote:

"Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Alvaro> I've pushed this now. Let's see what the buildfarm has to say
Alvaro> about it.

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

It definitely is ... plans have changed from using IndexOnly scans to
Seqscans, which is likely fallout from the visibilitymap_count() change.
Looking.

(I patched the Makefile to add -mpopcnt to all the compile lines rather
than just the frontend one, but I can't see that explaining the
failure.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Alvaro Herrera (#20)
Re: Using POPCNT and other advanced bit manipulation instructions

"Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

Alvaro> It definitely is ... plans have changed from using IndexOnly
Alvaro> scans to Seqscans, which is likely fallout from the
Alvaro> visibilitymap_count() change. Looking.

As for the rest, crake's "configure" failure was due to Andrew aborting
a run presumably to tweak the config, and fulmar finished a run just
before I committed the fix that should turn it green again. I'm
obviously going to keep watching, but to my knowledge only crake
(x-version test) and lorikeet (cygwin) should still be broken from my
stuff.

--
Andrew (irc:RhodiumToad)

#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#20)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Alvaro Herrera wrote:

It definitely is ... plans have changed from using IndexOnly scans to
Seqscans, which is likely fallout from the visibilitymap_count() change.

I think the problem here is that "unsigned long" is 32 bits in this
machine:
checking whether long int is 64 bits... no

and we have defined pg_popcount64() like this:

static int
pg_popcount64_sse42(uint64 word)
{
return __builtin_popcountl(word);
}

so it's counting bits in the lower half of the uint64.

If that's correct, then I think we need something like this patch. But
it makes me wonder whether we need a configure test for
__builtin_popcountll() and friends. I wonder if there's any compiler
that implements __builtin_popcountl() but not __builtin_popcountll() ...
and if not, then the test for __builtin_popcountl() should be removed,
and have everything rely on the one for __builtin_popcount().

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

uint64-is-longlong.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 97422e05040..fe6afb9fba5 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -299,7 +299,14 @@ pg_popcount64_choose(uint64 word)
 static int
 pg_popcount64_sse42(uint64 word)
 {
+#if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+	/* shouldn't happen */
+#error must have a working 64-bit integer datatype
+#endif
 }
 
 #endif
@@ -407,7 +414,14 @@ pg_rightmost_one64_choose(uint64 word)
 static int
 pg_rightmost_one64_abm(uint64 word)
 {
+#if defined(HAVE_LONG_INT_64)
 	return __builtin_ctzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_ctzll(word);
+#else
+	/* shouldn't happen */
+#error must have a working 64-bit integer datatype
+#endif
 }
 #endif
 
@@ -493,7 +507,15 @@ pg_leftmost_one64_choose(uint64 word)
 static int
 pg_leftmost_one64_abm(uint64 word)
 {
+#if defined(HAVE_LONG_INT_64)
 	return 63 - __builtin_clzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return 63 - __builtin_clzll(word);
+#else
+	/* shouldn't happen */
+#error must have a working 64-bit integer datatype
+#endif
+
 }
 #endif
 
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#22)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

and we have defined pg_popcount64() like this:

static int
pg_popcount64_sse42(uint64 word)
{
return __builtin_popcountl(word);
}

That is clearly completely broken.

If that's correct, then I think we need something like this patch. But
it makes me wonder whether we need a configure test for
__builtin_popcountll() and friends. I wonder if there's any compiler
that implements __builtin_popcountl() but not __builtin_popcountll() ...
and if not, then the test for __builtin_popcountl() should be removed,
and have everything rely on the one for __builtin_popcount().

AFAICS, this is a gcc-ism, and it looks like they've probably had
all width variants for the same amount of time. I'd take out the
test for __builtin_popcountl(), and assume that testing for
__builtin_popcount() is sufficient until proven differently.

regards, tom lane

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#23)
Re: Using POPCNT and other advanced bit manipulation instructions

... btw, why is pg_popcount casting away the const from its pointer
argument?

regards, tom lane

#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#23)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

and we have defined pg_popcount64() like this:

static int
pg_popcount64_sse42(uint64 word)
{
return __builtin_popcountl(word);
}

That is clearly completely broken.

Pushed my proposed fix, which includes removing the configure tests for
builtins of varying widths. I couldn't resist sorting entries
alphabetically in configure.in. (I also used autoheader to produce the
new pg_config.h, which showed me that David had not used it to generate
his diffs there.)

For pg_config.h.win32 I used the compiler explorer tool I just learned
about, and came to the conclusion that MSVC's compiler does not
implement these builtins.

I didn't do anything about the const-cast-away in pg_popcount() yet.
I think that should use PointerIsAligned() instead of what it's doing
now.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#24)
Re: Using POPCNT and other advanced bit manipulation instructions

And, while I'm complaining: why the devil is use of the compiler builtins
gated by HAVE__GET_CPUID? This is unbelievably Intel-centric, because
it prevents use of the builtins on other architectures. If the builtin
exists, we should use it, full stop. There's no reason to expect that it
would be slower than hand-rolled code, regardless of the architecture.

I'd be inclined to rip out all of the run-time-detection logic here;
I doubt any of it is buying anything that's worth the price of an
indirect call.

regards, tom lane

#27Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#26)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

And, while I'm complaining: why the devil is use of the compiler builtins
gated by HAVE__GET_CPUID? This is unbelievably Intel-centric, because
it prevents use of the builtins on other architectures. If the builtin
exists, we should use it, full stop. There's no reason to expect that it
would be slower than hand-rolled code, regardless of the architecture.

FWIW a quick test of __builtin_popcount(n) compiles as CNT on a Debian
ARM system, without any special compiler flags.

I'd be inclined to rip out all of the run-time-detection logic here;
I doubt any of it is buying anything that's worth the price of an
indirect call.

No view on that but apparently there were Intel Atom and AMD C chips
sold in the early part of this decade that lack POPCNT so I suspect
the distros can't ship software that requires it with no fallback.

--
Thomas Munro
http://www.enterprisedb.com

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#27)
Re: Using POPCNT and other advanced bit manipulation instructions

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'd be inclined to rip out all of the run-time-detection logic here;
I doubt any of it is buying anything that's worth the price of an
indirect call.

No view on that but apparently there were Intel Atom and AMD C chips
sold in the early part of this decade that lack POPCNT so I suspect
the distros can't ship software that requires it with no fallback.

Ah, I was not looking at the business with the optional -mpopcnt
compiler flag. I agree that we probably should not assume that
code compiled with that will run anywhere. But it's silly to build
all this infrastructure and then throw away the opportunity to
optimize for anything but late-model Intel.

A survey of the buildfarm results so far says that __builtin_clz
and __builtin_ctz exist just about everywhere, and even
__builtin_popcount is available on some non-Intel architectures.
It is reasonable to assume that those builtins are faster than
the C equivalents if they exist. It's reasonable to assume that
even on old-school Intel hardware.

The way this should have been done is to have a separate file
that's compiled with -mpopcnt if the compiler has that (and
has the builtins), and for the mainline file to have "slow"
versions that use the less-optimized builtins if available,
and only fall back to raw C code if not HAVE__BUILTIN_WHATEVER.

Also, in

#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)

static bool
pg_popcount_available(void)
{
unsigned int exx[4] = { 0, 0, 0, 0 };

#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
__cpuid(exx, 1);
#else
#error cpuid instruction not available
#endif

return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
#endif

it's obvious to the naked eye that the __cpuid() and #error
branches are unreachable because of the outer #if. I don't
think that was the design intention.

regards, tom lane

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#28)
Re: Using POPCNT and other advanced bit manipulation instructions

Some further thoughts here ...

Does the "lzcnt" runtime probe actually do anything useful?
On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
and __builtin_ctz compile to sequences involving bsrq and bsfq
regardless of -mpopcnt. It's fairly hard to see how lzcnt would
buy anything over those sequences even if there were zero overhead
involved in using it.

Alvaro noted that the test programs used by c-compiler.m4 fail
to produce any actual code involving the builtin, because of
compile-time constant folding. This seems pretty unwise.
I see that on my x86_64 compilers, without -mpopcnt,
__builtin_popcnt compiles to a call of some libgcc function
or other. It's conceivable that on an (arguably misconfigured)
platform, these c-compiler.m4 tests would pass yet the build fails
at link because libgcc lacks the needed infrastructure. These tests
should be coded in a way that doesn't allow the call to be optimized
away -- cf comments for PGAC_C_BUILTIN_OP_OVERFLOW.

Also, it's starting to seem like we have enough probes for compiler
builtins that we should fold them to use one set of infrastructure.
There are some like __builtin_constant_p that probably do need their
own custom tests, but these ones that just verify that a call
compiles seem pretty duplicative ...

regards, tom lane

#30Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Alvaro Herrera (#22)
Re: Using POPCNT and other advanced bit manipulation instructions

On 14/02/2019 11:17, Alvaro Herrera wrote:

On 2019-Feb-13, Alvaro Herrera wrote:

It definitely is ... plans have changed from using IndexOnly scans to
Seqscans, which is likely fallout from the visibilitymap_count() change.

I think the problem here is that "unsigned long" is 32 bits in this
machine:

[...]

From my memory of reading of K&R many moons ago, it said that C only
guarantees that the lengths are such that

byte <= half word <= word <= long

But I don't recall ever seeing a long less than 32 bits!

Cheers,
Gavin

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Flower (#30)
Re: Using POPCNT and other advanced bit manipulation instructions

Gavin Flower <GavinFlower@archidevsys.co.nz> writes:

From my memory of reading of K&R many moons ago, it said that C only
guarantees that the lengths are such that
byte <= half word <= word <= long

Indeed.

But I don't recall ever seeing a long less than 32 bits!

I'm not sure offhand what C89 said, but C99 requires "short" to be
at least 16 bits, "long" to be at least 32 bits, and "long long"
to be at least 64; see the minimum allowed values for SHRT_MAX etc.

C99 does permit "int" to be only 16 bits, but Postgres doesn't
pretend to work on such an architecture, and nobody's made one
since the (early?) 90s.

regards, tom lane

#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#29)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-14, Tom Lane wrote:

Some further thoughts here ...

Does the "lzcnt" runtime probe actually do anything useful?
On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
and __builtin_ctz compile to sequences involving bsrq and bsfq
regardless of -mpopcnt. It's fairly hard to see how lzcnt would
buy anything over those sequences even if there were zero overhead
involved in using it.

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

to
lzcntq %rax, %rax

Significant?

I have this patch now, written before I realized the above; I'll augment
it to cater to this (adding -mlzcnt and a new set of functions --
perhaps a new file "lzcnt.c" or maybe put the lot in pg_popcount.c and
rename it?) and resubmit after an errand I have to run now.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

popcount.patchtext/x-diff; charset=us-asciiDownload
commit 477cc89802effed5d80d4a82891c7ef3b5e61e63
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 15:31:32 2019 -0300

    fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..53787a7ef32 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,4 +23,8 @@ extern int (*pg_leftmost_one64) (uint64 word);
 
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int pg_popcount32_sse42(uint32 word);
+extern int pg_popcount64_sse42(uint64 word);
+
 #endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..d7290573c65 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,13 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a flag for the POPCOUNT instruction, we compile
+# pg_popcount.o with it.  (Whether to actually use the functions therein is
+# determined at runtime by testing CPUID flags.)
+ifneq ($(CFLAGS_POPCNT),)
+OBJS += pg_popcount.o
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +85,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..fc8de518791 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,52 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
 static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
+static int pg_popcount32_builtin(uint32 word);
 static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+static int pg_popcount64_builtin(uint64 word);
 int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
+static int pg_popcount32_slow(uint32 word);
+static int pg_popcount64_slow(uint64 word);
 int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
 int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
 #endif
 
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
+#if defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ)
+static bool pg_lzcnt_available(void);
+#endif
+
+#ifdef HAVE__BUILTIN_CTZ
+static int pg_rightmost_one32_choose(uint32 word);
+static int pg_rightmost_one32_abm(uint32 word);
+static int pg_rightmost_one64_choose(uint64 word);
+static int pg_rightmost_one64_abm(uint64 word);
 int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
 int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
 #else
 int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
 int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
 #endif
+static int pg_rightmost_one32_slow(uint32 word);
+static int pg_rightmost_one64_slow(uint64 word);
 
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_CLZ
+static int pg_leftmost_one32_choose(uint32 word);
+static int pg_leftmost_one32_abm(uint32 word);
+static int pg_leftmost_one64_choose(uint64 word);
+static int pg_leftmost_one64_abm(uint64 word);
 int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
 int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
 #else
 int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
 int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
 #endif
+static int pg_leftmost_one32_slow(uint32 word);
+static int pg_leftmost_one64_slow(uint64 word);
 
 /* Array marking the number of 1-bits for each value of 0-255. */
 static const uint8 number_of_ones[256] = {
@@ -147,27 +137,27 @@ static const uint8 leftmost_one_pos[256] = {
 	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
 };
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
 
 static bool
 pg_popcount_available(void)
 {
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
 	unsigned int exx[4] = { 0, 0, 0, 0 };
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
+#else		/* HAVE__GET_CPUID || HAVE__CPUID */
+
+	return false;
 #endif
+}
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -178,18 +168,19 @@ pg_popcount32_choose(uint32 word)
 	if (pg_popcount_available())
 		pg_popcount32 = pg_popcount32_sse42;
 	else
-		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount32 = pg_popcount32_builtin;
 
 	return pg_popcount32(word);
 }
 
 static int
-pg_popcount32_sse42(uint32 word)
+pg_popcount32_builtin(uint32 word)
 {
 	return __builtin_popcount(word);
 }
 #endif
 
+#ifndef HAVE__BUILTIN_POPCOUNT
 /*
  * pg_popcount32_slow
  *		Return the number of 1 bits set in word
@@ -207,6 +198,7 @@ pg_popcount32_slow(uint32 word)
 
 	return result;
 }
+#endif
 
 /*
  * pg_popcount
@@ -254,8 +246,7 @@ pg_popcount(const char *buf, int bytes)
 	return popcnt;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -266,26 +257,26 @@ pg_popcount64_choose(uint64 word)
 	if (pg_popcount_available())
 		pg_popcount64 = pg_popcount64_sse42;
 	else
-		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount64 = pg_popcount64_builtin;
 
 	return pg_popcount64(word);
 }
 
 static int
-pg_popcount64_sse42(uint64 word)
+pg_popcount64_builtin(uint64 word)
 {
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_popcountll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
 
-#endif
+#endif		/* HAVE__BUILTIN_POPCOUNT */
 
+#ifndef HAVE__BUILTIN_POPCOUNT
 /*
  * pg_popcount64_slow
  *		Return the number of 1 bits set in word
@@ -303,28 +294,28 @@ pg_popcount64_slow(uint64 word)
 
 	return result;
 }
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
+#endif
 
 static bool
 pg_lzcnt_available(void)
 {
-
+#if (defined(HAVE__GET_CPUID) || defined(HAVE__CPUID))
 	unsigned int exx[4] = { 0, 0, 0, 0 };
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 0x80000001);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
-}
-#endif
+#else		/* HAVE__GET_CPUID || HAVE__CPUID */
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
+	return false;
+#endif
+}
+
+#ifdef HAVE__BUILTIN_CTZ
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -346,7 +337,7 @@ pg_rightmost_one32_abm(uint32 word)
 	return __builtin_ctz(word);
 }
 
-#endif
+#endif		/* HAVE__BUILTIN_CTZ */
 
 /*
  * pg_rightmost_one32_slow
@@ -370,7 +361,7 @@ pg_rightmost_one32_slow(uint32 word)
 	return result;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
+#ifdef HAVE__BUILTIN_CTZ
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -394,11 +385,10 @@ pg_rightmost_one64_abm(uint64 word)
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_ctzll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
-#endif
+#endif		/* HAVE_BUILTIN_CTZ */
 
 /*
  * pg_rightmost_one64_slow
@@ -422,7 +412,7 @@ pg_rightmost_one64_slow(uint64 word)
 	return result;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
+#ifdef HAVE__BUILTIN_CLZ
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -443,7 +433,7 @@ pg_leftmost_one32_abm(uint32 word)
 {
 	return 31 - __builtin_clz(word);
 }
-#endif
+#endif		/* HAVE__BUILTIN_CLZ */
 
 /*
  * pg_leftmost_one32_slow
@@ -463,7 +453,7 @@ pg_leftmost_one32_slow(uint32 word)
 	return shift + leftmost_one_pos[(word >> shift) & 255];
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
+#ifdef HAVE__BUILTIN_CLZ
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -487,12 +477,10 @@ pg_leftmost_one64_abm(uint64 word)
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return 63 - __builtin_clzll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
-
 }
-#endif
+#endif		/* HAVE__BUIILTIN_CLZ */
 
 /*
  * pg_leftmost_one64_slow
diff --git a/src/port/pg_popcount.c b/src/port/pg_popcount.c
new file mode 100644
index 00000000000..5254c41273f
--- /dev/null
+++ b/src/port/pg_popcount.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount.c
+ *	  CPU-optimized implementation of pg_popcount
+ *
+ * This file must be compiled with a compiler-specific flag to enable the
+ * POPCOUNT instruction.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "port/pg_bitutils.h"
+
+int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+int
+pg_popcount64_sse42(uint64 word)
+{
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+}
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#32)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

to
lzcntq %rax, %rax

Significant?

I'd bet a fair amount of money that we'd be better off *not* using
lzcnt, even if available, because then we could just expose things
along this line:

static inline int
pg_clz(...)
{
#ifdef HAVE__BUILTIN_CLZ
return __builtin_clz(x);
#else
handwritten implementation;
#endif
}

Avoiding a function call (that has to indirect through a pointer) probably
saves much more than the difference between lzcnt and the other way.

The tradeoff might be different for popcount, though, especially since
it looks like __builtin_popcount() is not nearly as widely available
as the clz/ctz builtins.

regards, tom lane

#34Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#32)
Re: Using POPCNT and other advanced bit manipulation instructions

Hi,

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:

On 2019-Feb-14, Tom Lane wrote:

Some further thoughts here ...

Does the "lzcnt" runtime probe actually do anything useful?
On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz
and __builtin_ctz compile to sequences involving bsrq and bsfq
regardless of -mpopcnt. It's fairly hard to see how lzcnt would
buy anything over those sequences even if there were zero overhead
involved in using it.

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.

to
lzcntq %rax, %rax

Significant?

If I understand Agner's tables correctly, then no, this isn't faster
than the two instructions above.

Greetings,

Andres Freund

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#34)
Re: Using POPCNT and other advanced bit manipulation instructions

Andres Freund <andres@anarazel.de> writes:

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.

No. If you compile

int myclz(unsigned long long x)
{
return __builtin_clzll(x);
}

at -O2, on just about any x86_64 gcc, you will get

myclz:
.LFB1:
.cfi_startproc
bsrq %rdi, %rax
xorq $63, %rax
ret
.cfi_endproc

regards, tom lane

#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#33)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-14, Tom Lane wrote:

I'd bet a fair amount of money that we'd be better off *not* using
lzcnt, even if available, because then we could just expose things
along this line:

static inline int
pg_clz(...)
{
#ifdef HAVE__BUILTIN_CLZ
return __builtin_clz(x);
#else
handwritten implementation;
#endif
}

Avoiding a function call (that has to indirect through a pointer) probably
saves much more than the difference between lzcnt and the other way.

I see ... makes sense.

That leads me to the attached patch. It creates a new file
pg_popcount.c which is the only one compiled with -mpopcnt (if
available); if there's no compiler switch to enable POPCNT, we just
don't compile the file. I'm not sure that's kosher -- in particular I'm
not sure if it can fail when POPCNT is enabled by other flags and
-mpopcnt is not needed at all. I think our c-compiler.m4 stuff is a bit
too simplistic there: it just assumes that -mpopcnt is always required.
But what if the user passes it in CFLAGS?

I left CPUID alone for the CLZ/CTZ builtins. So we either use the
table, or the builtins. We never try the instructions.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

popcount-2.patchtext/x-diff; charset=us-asciiDownload
commit 6c771a4f43da0409ae5fa9ff1b1579f381c451c0
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 19:24:03 2019 -0300

    fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..635eb9331af 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,20 @@
  *
  *------------------------------------------------------------------------ -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
+extern int	pg_rightmost_one32(uint32 word);
+extern int	pg_rightmost_one64(uint64 word);
+extern int	pg_leftmost_one32(uint32 word);
+extern int	pg_leftmost_one64(uint64 word);
 
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int	pg_popcount32_sse42(uint32 word);
+extern int	pg_popcount64_sse42(uint64 word);
+
 #endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..d7290573c65 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,13 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a flag for the POPCOUNT instruction, we compile
+# pg_popcount.o with it.  (Whether to actually use the functions therein is
+# determined at runtime by testing CPUID flags.)
+ifneq ($(CFLAGS_POPCNT),)
+OBJS += pg_popcount.o
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +85,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..23d317d111e 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,21 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+static int	pg_popcount32_choose(uint32 word);
+static int	pg_popcount32_builtin(uint32 word);
+static int	pg_popcount64_choose(uint64 word);
+static int	pg_popcount64_builtin(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
-int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif
+static int	pg_popcount32_slow(uint32 word);
+static int	pg_popcount64_slow(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+#endif							/* !HAVE_BUILTIN_POPCOUNT */
 
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
-#else
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
-#endif
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
-#else
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
-#endif
 
 /* Array marking the number of 1-bits for each value of 0-255. */
 static const uint8 number_of_ones[256] = {
@@ -99,6 +58,7 @@ static const uint8 number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+#ifndef HAVE__BUILTIN_CTZ
 /*
  * Array marking the position of the right-most set bit for each value of
  * 1-255.  We count the right-most position as the 0th bit, and the
@@ -122,7 +82,9 @@ static const uint8 rightmost_one_pos[256] = {
 	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
 	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
 };
+#endif							/* !HAVE__BUILTIN_CTZ */
 
+#ifndef HAVE__BUILTIN_CLZ
 /*
  * Array marking the position of the left-most set bit for each value of
  * 1-255.  We count the right-most position as the 0th bit, and the
@@ -146,31 +108,36 @@ static const uint8 leftmost_one_pos[256] = {
 	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
 	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
 };
+#endif							/* !HAVE_BUILTIN_CLZ */
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+/*
+ * Return true iff we have CPUID support and it indicates that the POPCNT
+ * instruction is available.
+ */
 static bool
 pg_popcount_available(void)
 {
-	unsigned int exx[4] = { 0, 0, 0, 0 };
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+	unsigned int exx[4] = {0, 0, 0, 0};
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
+#else							/* HAVE__GET_CPUID || HAVE__CPUID */
+
+	return false;
 #endif
+}
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount32. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount32_choose(uint32 word)
@@ -178,18 +145,17 @@ pg_popcount32_choose(uint32 word)
 	if (pg_popcount_available())
 		pg_popcount32 = pg_popcount32_sse42;
 	else
-		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount32 = pg_popcount32_builtin;
 
 	return pg_popcount32(word);
 }
 
 static int
-pg_popcount32_sse42(uint32 word)
+pg_popcount32_builtin(uint32 word)
 {
 	return __builtin_popcount(word);
 }
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount32_slow
  *		Return the number of 1 bits set in word
@@ -197,7 +163,7 @@ pg_popcount32_sse42(uint32 word)
 static int
 pg_popcount32_slow(uint32 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -207,6 +173,7 @@ pg_popcount32_slow(uint32 word)
 
 	return result;
 }
+#endif
 
 /*
  * pg_popcount
@@ -215,13 +182,13 @@ pg_popcount32_slow(uint32 word)
 uint64
 pg_popcount(const char *buf, int bytes)
 {
-	uint64 popcnt = 0;
+	uint64		popcnt = 0;
 
 #if SIZEOF_VOID_P >= 8
 	/* Process in 64-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(8, buf))
 	{
-		uint64 *words = (uint64 *) buf;
+		uint64	   *words = (uint64 *) buf;
 
 		while (bytes >= 8)
 		{
@@ -235,7 +202,7 @@ pg_popcount(const char *buf, int bytes)
 	/* Process in 32-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(4, buf))
 	{
-		uint32 *words = (uint32 *) buf;
+		uint32	   *words = (uint32 *) buf;
 
 		while (bytes >= 4)
 		{
@@ -254,11 +221,11 @@ pg_popcount(const char *buf, int bytes)
 	return popcnt;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount64. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount64_choose(uint64 word)
@@ -266,26 +233,24 @@ pg_popcount64_choose(uint64 word)
 	if (pg_popcount_available())
 		pg_popcount64 = pg_popcount64_sse42;
 	else
-		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount64 = pg_popcount64_builtin;
 
 	return pg_popcount64(word);
 }
 
 static int
-pg_popcount64_sse42(uint64 word)
+pg_popcount64_builtin(uint64 word)
 {
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_popcountll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
 
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount64_slow
  *		Return the number of 1 bits set in word
@@ -293,7 +258,7 @@ pg_popcount64_sse42(uint64 word)
 static int
 pg_popcount64_slow(uint64 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -303,156 +268,77 @@ pg_popcount64_slow(uint64 word)
 
 	return result;
 }
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-
-static bool
-pg_lzcnt_available(void)
-{
-
-	unsigned int exx[4] = { 0, 0, 0, 0 };
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 0x80000001);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
-}
-#endif
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one32 = pg_rightmost_one32_abm;
-	else
-		pg_rightmost_one32 = pg_rightmost_one32_slow;
-
-	return pg_rightmost_one32(word);
-}
-
-static int
-pg_rightmost_one32_abm(uint32 word)
-{
-	return __builtin_ctz(word);
-}
-
 #endif
 
 /*
- * pg_rightmost_one32_slow
+ * pg_rightmost_one32
  *		Returns the number of trailing 0-bits in word, starting at the least
  *		significant bit position. word must not be 0.
  */
-static int
-pg_rightmost_one32_slow(uint32 word)
+int
+pg_rightmost_one32(uint32 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	Assert(word != 0);
 
+#ifdef HAVE__BUILTIN_CTZ
+	result = __builtin_ctz(word);
+#else
 	while ((word & 255) == 0)
 	{
 		word >>= 8;
 		result += 8;
 	}
 	result += rightmost_one_pos[word & 255];
+#endif							/* HAVE__BUILTIN_CTZ */
 
 	return result;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * pg_rightmost_one64
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
  */
-static int
-pg_rightmost_one64_choose(uint64 word)
+int
+pg_rightmost_one64(uint64 word)
 {
-	if (pg_lzcnt_available())
-		pg_rightmost_one64 = pg_rightmost_one64_abm;
-	else
-		pg_rightmost_one64 = pg_rightmost_one64_slow;
+	int			result = 0;
 
-	return pg_rightmost_one64(word);
-}
+	Assert(word != 0);
 
-static int
-pg_rightmost_one64_abm(uint64 word)
-{
+#ifdef HAVE__BUILTIN_CTZ
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_ctzl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_ctzll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
-}
-#endif
-
-/*
- * pg_rightmost_one64_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one64_slow(uint64 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
+#else							/* HAVE__BUILTIN_CTZ */
 	while ((word & 255) == 0)
 	{
 		word >>= 8;
 		result += 8;
 	}
 	result += rightmost_one_pos[word & 255];
+#endif
 
 	return result;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one32 = pg_leftmost_one32_abm;
-	else
-		pg_leftmost_one32 = pg_leftmost_one32_slow;
-
-	return pg_leftmost_one32(word);
-}
-
-static int
-pg_leftmost_one32_abm(uint32 word)
-{
-	return 31 - __builtin_clz(word);
-}
-#endif
-
-/*
- * pg_leftmost_one32_slow
+ * pg_leftmost_one32
  *		Returns the 0-based position of the most significant set bit in word
  *		measured from the least significant bit.  word must not be 0.
  */
-static int
-pg_leftmost_one32_slow(uint32 word)
+int
+pg_leftmost_one32(uint32 word)
 {
+#ifdef HAVE__BUILTIN_CLZ
+	return 31 - __builtin_clz(word);
+#else
 	int			shift = 32 - 8;
 
 	Assert(word != 0);
@@ -461,53 +347,32 @@ pg_leftmost_one32_slow(uint32 word)
 		shift -= 8;
 
 	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* HAVE__BUILTIN_CLZ */
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * pg_leftmost_one64
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
  */
-static int
-pg_leftmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one64 = pg_leftmost_one64_abm;
-	else
-		pg_leftmost_one64 = pg_leftmost_one64_slow;
-
-	return pg_leftmost_one64(word);
-}
-
-static int
-pg_leftmost_one64_abm(uint64 word)
+int
+pg_leftmost_one64(uint64 word)
 {
+#ifdef HAVE__BUILTIN_CLZ
 #if defined(HAVE_LONG_INT_64)
 	return 63 - __builtin_clzl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return 63 - __builtin_clzll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
-
-}
-#endif
-
-/*
- * pg_leftmost_one64_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one64_slow(uint64 word)
-{
+#else							/* HAVE__BUILTIN_CLZ */
 	int			shift = 64 - 8;
 
 	Assert(word != 0);
-
 	while ((word >> shift) == 0)
 		shift -= 8;
 
 	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* !HAVE__BUIILTIN_CLZ */
 }
diff --git a/src/port/pg_popcount.c b/src/port/pg_popcount.c
new file mode 100644
index 00000000000..5254c41273f
--- /dev/null
+++ b/src/port/pg_popcount.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount.c
+ *	  CPU-optimized implementation of pg_popcount
+ *
+ * This file must be compiled with a compiler-specific flag to enable the
+ * POPCOUNT instruction.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "port/pg_bitutils.h"
+
+int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+int
+pg_popcount64_sse42(uint64 word)
+{
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+}
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#33)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-14, Tom Lane wrote:

static inline int
pg_clz(...)

Hmm, I missed this bit. So we put all these functions in the header, as
in the attached.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

popcount-3.patchtext/x-diff; charset=us-asciiDownload
commit d4265b36754720f85bf2ac6fd9ae2c58b8e1abc2
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Thu Feb 14 15:18:03 2019 -0300
CommitDate: Thu Feb 14 19:41:41 2019 -0300

    fix popcount etc

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..72dfd1d2695 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,175 @@
  *
  *------------------------------------------------------------------------ -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
-
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int	pg_popcount32_sse42(uint32 word);
+extern int	pg_popcount64_sse42(uint64 word);
+
+
+#ifndef HAVE__BUILTIN_CTZ
+/*
+ * Array marking the position of the right-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+#endif							/* !HAVE__BUILTIN_CTZ */
+
+/*
+ * pg_rightmost_one32
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one32(uint32 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+	result = __builtin_ctz(word);
+#else
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif							/* HAVE__BUILTIN_CTZ */
+
+	return result;
+}
+
+/*
+ * pg_rightmost_one64
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one64(uint64 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_ctzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_ctzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CTZ */
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif
+
+	return result;
+}
+
+#ifndef HAVE__BUILTIN_CLZ
+/*
+ * Array marking the position of the left-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+#endif							/* !HAVE_BUILTIN_CLZ */
+
+/*
+ * pg_leftmost_one32
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one32(uint32 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	return 31 - __builtin_clz(word);
+#else
+	int			shift = 32 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* HAVE__BUILTIN_CLZ */
+}
+
+/*
+ * pg_leftmost_one64
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+#if defined(HAVE_LONG_INT_64)
+	return 63 - __builtin_clzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return 63 - __builtin_clzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CLZ */
+	int			shift = 64 - 8;
+
+	Assert(word != 0);
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* !HAVE__BUIILTIN_CLZ */
+}
+
 #endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..d7290573c65 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,13 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a flag for the POPCOUNT instruction, we compile
+# pg_popcount.o with it.  (Whether to actually use the functions therein is
+# determined at runtime by testing CPUID flags.)
+ifneq ($(CFLAGS_POPCNT),)
+OBJS += pg_popcount.o
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +85,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..3d26883111f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,21 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+static int	pg_popcount32_choose(uint32 word);
+static int	pg_popcount32_builtin(uint32 word);
+static int	pg_popcount64_choose(uint64 word);
+static int	pg_popcount64_builtin(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
-int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif
+static int	pg_popcount32_slow(uint32 word);
+static int	pg_popcount64_slow(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+#endif							/* !HAVE_BUILTIN_POPCOUNT */
 
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
-#else
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
-#endif
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
-#else
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
-#endif
 
 /* Array marking the number of 1-bits for each value of 0-255. */
 static const uint8 number_of_ones[256] = {
@@ -100,77 +59,33 @@ static const uint8 number_of_ones[256] = {
 };
 
 /*
- * Array marking the position of the right-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
+ * Return true iff we have CPUID support and it indicates that the POPCNT
+ * instruction is available.
  */
-static const uint8 rightmost_one_pos[256] = {
-	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
-};
-
-/*
- * Array marking the position of the left-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
- */
-static const uint8 leftmost_one_pos[256] = {
-	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
-	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
-};
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
 static bool
 pg_popcount_available(void)
 {
-	unsigned int exx[4] = { 0, 0, 0, 0 };
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+	unsigned int exx[4] = {0, 0, 0, 0};
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
+#else							/* HAVE__GET_CPUID || HAVE__CPUID */
+
+	return false;
 #endif
+}
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount32. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount32_choose(uint32 word)
@@ -178,18 +93,17 @@ pg_popcount32_choose(uint32 word)
 	if (pg_popcount_available())
 		pg_popcount32 = pg_popcount32_sse42;
 	else
-		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount32 = pg_popcount32_builtin;
 
 	return pg_popcount32(word);
 }
 
 static int
-pg_popcount32_sse42(uint32 word)
+pg_popcount32_builtin(uint32 word)
 {
 	return __builtin_popcount(word);
 }
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount32_slow
  *		Return the number of 1 bits set in word
@@ -197,7 +111,7 @@ pg_popcount32_sse42(uint32 word)
 static int
 pg_popcount32_slow(uint32 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -207,6 +121,7 @@ pg_popcount32_slow(uint32 word)
 
 	return result;
 }
+#endif
 
 /*
  * pg_popcount
@@ -215,13 +130,13 @@ pg_popcount32_slow(uint32 word)
 uint64
 pg_popcount(const char *buf, int bytes)
 {
-	uint64 popcnt = 0;
+	uint64		popcnt = 0;
 
 #if SIZEOF_VOID_P >= 8
 	/* Process in 64-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(8, buf))
 	{
-		uint64 *words = (uint64 *) buf;
+		uint64	   *words = (uint64 *) buf;
 
 		while (bytes >= 8)
 		{
@@ -235,7 +150,7 @@ pg_popcount(const char *buf, int bytes)
 	/* Process in 32-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(4, buf))
 	{
-		uint32 *words = (uint32 *) buf;
+		uint32	   *words = (uint32 *) buf;
 
 		while (bytes >= 4)
 		{
@@ -254,11 +169,11 @@ pg_popcount(const char *buf, int bytes)
 	return popcnt;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount64. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount64_choose(uint64 word)
@@ -266,26 +181,24 @@ pg_popcount64_choose(uint64 word)
 	if (pg_popcount_available())
 		pg_popcount64 = pg_popcount64_sse42;
 	else
-		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount64 = pg_popcount64_builtin;
 
 	return pg_popcount64(word);
 }
 
 static int
-pg_popcount64_sse42(uint64 word)
+pg_popcount64_builtin(uint64 word)
 {
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_popcountll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
 
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount64_slow
  *		Return the number of 1 bits set in word
@@ -293,7 +206,7 @@ pg_popcount64_sse42(uint64 word)
 static int
 pg_popcount64_slow(uint64 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -303,211 +216,4 @@ pg_popcount64_slow(uint64 word)
 
 	return result;
 }
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-
-static bool
-pg_lzcnt_available(void)
-{
-
-	unsigned int exx[4] = { 0, 0, 0, 0 };
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 0x80000001);
-#else
-#error cpuid instruction not available
 #endif
-
-	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
-}
-#endif
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one32 = pg_rightmost_one32_abm;
-	else
-		pg_rightmost_one32 = pg_rightmost_one32_slow;
-
-	return pg_rightmost_one32(word);
-}
-
-static int
-pg_rightmost_one32_abm(uint32 word)
-{
-	return __builtin_ctz(word);
-}
-
-#endif
-
-/*
- * pg_rightmost_one32_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one32_slow(uint32 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one64 = pg_rightmost_one64_abm;
-	else
-		pg_rightmost_one64 = pg_rightmost_one64_slow;
-
-	return pg_rightmost_one64(word);
-}
-
-static int
-pg_rightmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return __builtin_ctzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return __builtin_ctzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-}
-#endif
-
-/*
- * pg_rightmost_one64_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one64_slow(uint64 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one32 = pg_leftmost_one32_abm;
-	else
-		pg_leftmost_one32 = pg_leftmost_one32_slow;
-
-	return pg_leftmost_one32(word);
-}
-
-static int
-pg_leftmost_one32_abm(uint32 word)
-{
-	return 31 - __builtin_clz(word);
-}
-#endif
-
-/*
- * pg_leftmost_one32_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one32_slow(uint32 word)
-{
-	int			shift = 32 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one64 = pg_leftmost_one64_abm;
-	else
-		pg_leftmost_one64 = pg_leftmost_one64_slow;
-
-	return pg_leftmost_one64(word);
-}
-
-static int
-pg_leftmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return 63 - __builtin_clzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return 63 - __builtin_clzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-
-}
-#endif
-
-/*
- * pg_leftmost_one64_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one64_slow(uint64 word)
-{
-	int			shift = 64 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
diff --git a/src/port/pg_popcount.c b/src/port/pg_popcount.c
new file mode 100644
index 00000000000..5254c41273f
--- /dev/null
+++ b/src/port/pg_popcount.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount.c
+ *	  CPU-optimized implementation of pg_popcount
+ *
+ * This file must be compiled with a compiler-specific flag to enable the
+ * POPCOUNT instruction.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "port/pg_bitutils.h"
+
+int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+int
+pg_popcount64_sse42(uint64 word)
+{
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+}
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#36)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

That leads me to the attached patch. It creates a new file
pg_popcount.c which is the only one compiled with -mpopcnt (if
available); if there's no compiler switch to enable POPCNT, we just
don't compile the file. I'm not sure that's kosher -- in particular I'm
not sure if it can fail when POPCNT is enabled by other flags and
-mpopcnt is not needed at all. I think our c-compiler.m4 stuff is a bit
too simplistic there: it just assumes that -mpopcnt is always required.

Yes, the configure test for this stuff is really pretty broken.
It's conflating two nearly independent questions: (1) does the compiler
have __builtin_popcount(), and (2) does the compiler accept -mpopcnt.
It is certainly the case that (1) may hold without (2); in fact, every
recent non-x86_64 gcc is a counterexample to how that's done in HEAD.

I think we need a clean test for __builtin_popcount(), and to be willing
to use it if available, independently of -mpopcnt. Then separately we
should test to see if -mpopcnt works, probably with the same
infrastructure we use for checking for other compiler flags, viz

   # Optimization flags for specific files that benefit from vectorization
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
   PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
+  # Optimization flags for bit-twiddling
+  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
   # We want to suppress clang's unhelpful unused-command-line-argument warnings

Then the correct test to see if we want to build pg_popcount.c (BTW,
please pick a less generic name for that) and the choose function
is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
CFLAGS_POPCNT.

I don't think this'd be fooled by user-specified CFLAGS. The worst
possible outcome is that it builds a function that we intended would
use POPCNT but it's falling back to some other implementation, in
case the compiler has a switch named -mpopcnt but it doesn't do what
we think it does, or the user overrode things by adding -mno-popcnt.
That would really be nearly cost-free, other than the overhead of
the choose function the first time through: both of the execution
functions would be reducing to __builtin_popcount(), for whatever
version of that the compiler is giving us, so the choice wouldn't
matter.

regards, tom lane

#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#38)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-14, Tom Lane wrote:

I think we need a clean test for __builtin_popcount(), and to be willing
to use it if available, independently of -mpopcnt. Then separately we
should test to see if -mpopcnt works, probably with the same
infrastructure we use for checking for other compiler flags, viz

# Optimization flags for specific files that benefit from vectorization
PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops])
PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize])
+  # Optimization flags for bit-twiddling
+  PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
# We want to suppress clang's unhelpful unused-command-line-argument warnings

Then the correct test to see if we want to build pg_popcount.c (BTW,
please pick a less generic name for that) and the choose function
is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
CFLAGS_POPCNT.

Yeah, this works. I'll post the patch tomorrow.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#40Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Tom Lane (#35)
Re: Using POPCNT and other advanced bit manipulation instructions

At Thu, 14 Feb 2019 16:45:38 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <822.1550180738@sss.pgh.pa.us>

Andres Freund <andres@anarazel.de> writes:

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.

No. If you compile

int myclz(unsigned long long x)
{
return __builtin_clzll(x);
}

at -O2, on just about any x86_64 gcc, you will get

myclz:
.LFB1:
.cfi_startproc
bsrq %rdi, %rax
xorq $63, %rax
ret
.cfi_endproc

I understand that the behavior of __builtin_c[tl]z(0) is
undefined from the reason, they convert to bs[rf]. So if we use
these builtins, additional check is required.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most
significant bit position. If x is 0, the result is undefined.

Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the
least significant bit position. If x is 0, the result is
undefined.

And even worse lzcntx is accidentially "fallback"s to bsrx on
unsupported CPUs, which leads to bogus results.
__builtin_clzll(8) = 3, which should be 60.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro HORIGUCHI (#40)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

I understand that the behavior of __builtin_c[tl]z(0) is
undefined from the reason, they convert to bs[rf]. So if we use
these builtins, additional check is required.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Okay -- the functions check for a 0 argument:

+static inline int
+pg_rightmost_one32(uint32 word)
+{
+   int         result = 0;
+
+   Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+   result = __builtin_ctz(word);
+#else
+   while ((word & 255) == 0)
+   {
+       word >>= 8;
+       result += 8;
+   }
+   result += rightmost_one_pos[word & 255];
+#endif                         /* HAVE__BUILTIN_CTZ */
+
+   return result;
+}

so we're fine.

And even worse lzcntx is accidentially "fallback"s to bsrx on
unsupported CPUs, which leads to bogus results.
__builtin_clzll(8) = 3, which should be 60.

I'm not sure I understand this point. Are you saying that if we use
-mlzcnt to compile, then the compiler builtin is broken in platforms
that don't support the lzcnt instruction? That's horrible. Let's stay
away from -mlzcnt then.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#41)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-15, Alvaro Herrera wrote:

On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

And even worse lzcntx is accidentially "fallback"s to bsrx on
unsupported CPUs, which leads to bogus results.
__builtin_clzll(8) = 3, which should be 60.

I'm not sure I understand this point. Are you saying that if we use
-mlzcnt to compile, then the compiler builtin is broken in platforms
that don't support the lzcnt instruction? That's horrible. Let's stay
away from -mlzcnt then.

Ah, I understand it now:
https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
SIGILL or anything ... it'll just silently compute the wrong result.
That's certainly not what I call a fallback!

I think David's code was correct because it was testing CPUID for
instruction support before using the specialized code (well, except that
he forgot to add the right compiler option to *enable* the LZCNT/TZCNT
instructions in the first place); but given subsequent discussion that
the instruction is not worth it anyway, we might as well ignore it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#38)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-14, Tom Lane wrote:

Then the correct test to see if we want to build pg_popcount.c (BTW,
please pick a less generic name for that) and the choose function
is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
CFLAGS_POPCNT.

I used pg_bitutils_sse42.c to host the specially-compiled functions.
The name is not entirely correct, but seems clear enough.

I noticed in Compiler Explorer that some (ancient?) Power cpus
implement instruction "popcntb", and GCC support for those uses
-mpopcntb switch enabling __builtin_popcount() to use it. I added the
switch to configure.in but I'm not sure how well that will work ... I
don't know if this is represented in buildfarm.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Fix-compiler-builtin-usage-in-new-pg_bitutils.c.patchtext/x-diff; charset=us-asciiDownload
From a3c654f9446ed0f8ead57d4f7202554311135dbf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 15 Feb 2019 11:14:04 -0300
Subject: [PATCH] Fix compiler builtin usage in new pg_bitutils.c

Split out these new functions in three parts: one in a new file that
uses the compiler builtin and gets compiled with the -mpopcnt compiler
option if it exists; another one that uses the compiler builtin but not
the compiler option; and finally the fallback with open-coded
algorithms.

Split out the configure logic: in the original commit, it was selecting
to use the -mpopcnt compiler switch together with deciding to use the
compiler builtin.  However, some compilers implement the builtin even
though they don't have the compiler switch, so split both things.  Also,
expose whether the builtin exists to Makefile.global, so that src/port's
can decide whether to compile the special file.

Remove CPUID test for CTZ/CLZ.
---
 config/c-compiler.m4           |  22 +-
 configure                      | 105 +++++++--
 configure.in                   |   7 +-
 src/Makefile.global.in         |   3 +
 src/include/port/pg_bitutils.h | 176 +++++++++++++++-
 src/port/Makefile              |  16 +-
 src/port/pg_bitutils.c         | 374 ++++-----------------------------
 src/port/pg_bitutils_sse42.c   |  36 ++++
 8 files changed, 365 insertions(+), 374 deletions(-)
 create mode 100644 src/port/pg_bitutils_sse42.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 05fa82518f8..7c0d52b515f 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -381,22 +381,16 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 # PGAC_C_BUILTIN_POPCOUNT
 # -------------------------
 AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcount])])dnl
-AC_CACHE_CHECK([for __builtin_popcount], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-[static int x = __builtin_popcount(255);])],
-[Ac_cachevar=yes],
-[Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
-if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+[AC_CACHE_CHECK([for __builtin_popcount], pgac_cv__builtin_popcount,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);]
+)],
+[pgac_cv__builtin_popcount=yes],
+[pgac_cv__builtin_popcount=no])])
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
           [Define to 1 if your compiler understands __builtin_popcount.])
-fi
-undefine([Ac_cachevar])dnl
-])# PGAC_C_BUILTIN_POPCOUNT
+fi])# PGAC_C_BUILTIN_POPCOUNT
 
 
 
diff --git a/configure b/configure
index 73e9c235b69..fa0f1216a0a 100755
--- a/configure
+++ b/configure
@@ -651,7 +651,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
-CFLAGS_POPCNT
+have__builtin_popcount
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -733,6 +733,7 @@ CPP
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
 CFLAGS_VECTOR
+CFLAGS_POPCNT
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
 LLVM_CXXFLAGS
@@ -6581,6 +6582,87 @@ fi
 
 fi
 
+# Optimization flags and options for bit-twiddling
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS_POPCNT} -mpopcnt"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcnt=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcnt" = x"yes"; then
+  CFLAGS_POPCNT="${CFLAGS_POPCNT} -mpopcnt"
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcntb, for CFLAGS_POPCNT" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcntb, for CFLAGS_POPCNT... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcntb+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS_POPCNT} -mpopcntb"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcntb=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcntb=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcntb" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcntb" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcntb" = x"yes"; then
+  CFLAGS_POPCNT="${CFLAGS_POPCNT} -mpopcntb"
+fi
+
+
+
+
 CFLAGS_VECTOR=$CFLAGS_VECTOR
 
 
@@ -14111,32 +14193,28 @@ $as_echo "#define HAVE__BUILTIN_CTZ 1" >>confdefs.h
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcount" >&5
 $as_echo_n "checking for __builtin_popcount... " >&6; }
-if ${pgac_cv_popcount+:} false; then :
+if ${pgac_cv__builtin_popcount+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 static int x = __builtin_popcount(255);
+
 _ACEOF
 if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_popcount=yes
+  pgac_cv__builtin_popcount=yes
 else
-  pgac_cv_popcount=no
+  pgac_cv__builtin_popcount=no
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_popcount" >&5
-$as_echo "$pgac_cv_popcount" >&6; }
-if test x"$pgac_cv_popcount" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_popcount" >&5
+$as_echo "$pgac_cv__builtin_popcount" >&6; }
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 
 $as_echo "#define HAVE__BUILTIN_POPCOUNT 1" >>confdefs.h
 
 fi
-
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_unreachable" >&5
 $as_echo_n "checking for __builtin_unreachable... " >&6; }
 if ${pgac_cv__builtin_unreachable+:} false; then :
@@ -14654,6 +14732,7 @@ $as_echo "#define LOCALE_T_IN_XLOCALE 1" >>confdefs.h
 
 fi
 
+have__builtin_popcount=$pgac_cv__builtin_popcount
 
 
 # MSVC doesn't cope well with defining restrict to __restrict, the
diff --git a/configure.in b/configure.in
index 9c4d5f0691e..4a5f1fd62c1 100644
--- a/configure.in
+++ b/configure.in
@@ -547,6 +547,11 @@ elif test "$PORTNAME" = "hpux"; then
   PGAC_PROG_CXX_CFLAGS_OPT([+Olibmerrno])
 fi
 
+# Optimization flags and options for bit-twiddling
+PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
+PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcntb])
+AC_SUBST(CFLAGS_POPCNT)
+
 AC_SUBST(CFLAGS_VECTOR, $CFLAGS_VECTOR)
 
 # Determine flags used to emit bitcode for JIT inlining. Need to test
@@ -1506,7 +1511,7 @@ AC_TYPE_LONG_LONG_INT
 
 PGAC_TYPE_LOCALE_T
 
-AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(have__builtin_popcount, $pgac_cv__builtin_popcount)
 
 # MSVC doesn't cope well with defining restrict to __restrict, the
 # spelling it understands, because it conflicts with
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index aa16da3e0f2..0f4dd195845 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -517,6 +517,9 @@ WIN32_STACK_RLIMIT=4194304
 # Set if we have a working win32 crashdump header
 have_win32_dbghelp = @have_win32_dbghelp@
 
+# Set if __builtin_popcount() is supported by $(CC)
+have__builtin_popcount = @have__builtin_popcount@
+
 # Pull in platform-specific magic
 include $(top_builddir)/src/Makefile.port
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..7b76a138f89 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,177 @@
  *
  *------------------------------------------------------------------------ -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
-
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_popcount.c */
+extern int	pg_popcount32_sse42(uint32 word);
+extern int	pg_popcount64_sse42(uint64 word);
+
+
+#ifndef HAVE__BUILTIN_CTZ
+/*
+ * Array marking the position of the right-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+#endif							/* !HAVE__BUILTIN_CTZ */
+
+/*
+ * pg_rightmost_one32
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one32(uint32 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+	result = __builtin_ctz(word);
+#else
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif							/* HAVE__BUILTIN_CTZ */
+
+	return result;
+}
+
+/*
+ * pg_rightmost_one64
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one64(uint64 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+#ifdef HAVE__BUILTIN_CTZ
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_ctzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_ctzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CTZ */
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif
+
+	return result;
+}
+
+#ifndef HAVE__BUILTIN_CLZ
+/*
+ * Array marking the position of the left-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+#endif							/* !HAVE_BUILTIN_CLZ */
+
+/*
+ * pg_leftmost_one32
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one32(uint32 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	Assert(word != 0);
+
+	return 31 - __builtin_clz(word);
+#else
+	int			shift = 32 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* HAVE__BUILTIN_CLZ */
+}
+
+/*
+ * pg_leftmost_one64
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	Assert(word != 0);
+#if defined(HAVE_LONG_INT_64)
+	return 63 - __builtin_clzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return 63 - __builtin_clzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CLZ */
+	int			shift = 64 - 8;
+
+	Assert(word != 0);
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* !HAVE__BUIILTIN_CLZ */
+}
+
 #endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..237cc625e19 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,14 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a special flag for the POPCOUNT instruction and it
+# has __builtin_popcount, add pg_bitutils_sse42.o.
+ifneq ($(CFLAGS_POPCNT),)
+ifeq ($(have__builtin_popcount),yes)
+OBJS += pg_bitutils_sse42.o
+endif
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +86,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# pg_popcount.c needs CFLAGS_POPCNT
+pg_popcount.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..3d26883111f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,21 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+static int	pg_popcount32_choose(uint32 word);
+static int	pg_popcount32_builtin(uint32 word);
+static int	pg_popcount64_choose(uint64 word);
+static int	pg_popcount64_builtin(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
-int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif
+static int	pg_popcount32_slow(uint32 word);
+static int	pg_popcount64_slow(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+#endif							/* !HAVE_BUILTIN_POPCOUNT */
 
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
-#else
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
-#endif
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
-#else
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
-#endif
 
 /* Array marking the number of 1-bits for each value of 0-255. */
 static const uint8 number_of_ones[256] = {
@@ -100,77 +59,33 @@ static const uint8 number_of_ones[256] = {
 };
 
 /*
- * Array marking the position of the right-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
+ * Return true iff we have CPUID support and it indicates that the POPCNT
+ * instruction is available.
  */
-static const uint8 rightmost_one_pos[256] = {
-	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
-};
-
-/*
- * Array marking the position of the left-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
- */
-static const uint8 leftmost_one_pos[256] = {
-	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
-	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
-};
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
 static bool
 pg_popcount_available(void)
 {
-	unsigned int exx[4] = { 0, 0, 0, 0 };
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+	unsigned int exx[4] = {0, 0, 0, 0};
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
+#else							/* HAVE__GET_CPUID || HAVE__CPUID */
+
+	return false;
 #endif
+}
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount32. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount32_choose(uint32 word)
@@ -178,18 +93,17 @@ pg_popcount32_choose(uint32 word)
 	if (pg_popcount_available())
 		pg_popcount32 = pg_popcount32_sse42;
 	else
-		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount32 = pg_popcount32_builtin;
 
 	return pg_popcount32(word);
 }
 
 static int
-pg_popcount32_sse42(uint32 word)
+pg_popcount32_builtin(uint32 word)
 {
 	return __builtin_popcount(word);
 }
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount32_slow
  *		Return the number of 1 bits set in word
@@ -197,7 +111,7 @@ pg_popcount32_sse42(uint32 word)
 static int
 pg_popcount32_slow(uint32 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -207,6 +121,7 @@ pg_popcount32_slow(uint32 word)
 
 	return result;
 }
+#endif
 
 /*
  * pg_popcount
@@ -215,13 +130,13 @@ pg_popcount32_slow(uint32 word)
 uint64
 pg_popcount(const char *buf, int bytes)
 {
-	uint64 popcnt = 0;
+	uint64		popcnt = 0;
 
 #if SIZEOF_VOID_P >= 8
 	/* Process in 64-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(8, buf))
 	{
-		uint64 *words = (uint64 *) buf;
+		uint64	   *words = (uint64 *) buf;
 
 		while (bytes >= 8)
 		{
@@ -235,7 +150,7 @@ pg_popcount(const char *buf, int bytes)
 	/* Process in 32-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(4, buf))
 	{
-		uint32 *words = (uint32 *) buf;
+		uint32	   *words = (uint32 *) buf;
 
 		while (bytes >= 4)
 		{
@@ -254,11 +169,11 @@ pg_popcount(const char *buf, int bytes)
 	return popcnt;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount64. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount64_choose(uint64 word)
@@ -266,26 +181,24 @@ pg_popcount64_choose(uint64 word)
 	if (pg_popcount_available())
 		pg_popcount64 = pg_popcount64_sse42;
 	else
-		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount64 = pg_popcount64_builtin;
 
 	return pg_popcount64(word);
 }
 
 static int
-pg_popcount64_sse42(uint64 word)
+pg_popcount64_builtin(uint64 word)
 {
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_popcountll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
 
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount64_slow
  *		Return the number of 1 bits set in word
@@ -293,7 +206,7 @@ pg_popcount64_sse42(uint64 word)
 static int
 pg_popcount64_slow(uint64 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -303,211 +216,4 @@ pg_popcount64_slow(uint64 word)
 
 	return result;
 }
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-
-static bool
-pg_lzcnt_available(void)
-{
-
-	unsigned int exx[4] = { 0, 0, 0, 0 };
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 0x80000001);
-#else
-#error cpuid instruction not available
 #endif
-
-	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
-}
-#endif
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one32 = pg_rightmost_one32_abm;
-	else
-		pg_rightmost_one32 = pg_rightmost_one32_slow;
-
-	return pg_rightmost_one32(word);
-}
-
-static int
-pg_rightmost_one32_abm(uint32 word)
-{
-	return __builtin_ctz(word);
-}
-
-#endif
-
-/*
- * pg_rightmost_one32_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one32_slow(uint32 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one64 = pg_rightmost_one64_abm;
-	else
-		pg_rightmost_one64 = pg_rightmost_one64_slow;
-
-	return pg_rightmost_one64(word);
-}
-
-static int
-pg_rightmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return __builtin_ctzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return __builtin_ctzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-}
-#endif
-
-/*
- * pg_rightmost_one64_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one64_slow(uint64 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one32 = pg_leftmost_one32_abm;
-	else
-		pg_leftmost_one32 = pg_leftmost_one32_slow;
-
-	return pg_leftmost_one32(word);
-}
-
-static int
-pg_leftmost_one32_abm(uint32 word)
-{
-	return 31 - __builtin_clz(word);
-}
-#endif
-
-/*
- * pg_leftmost_one32_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one32_slow(uint32 word)
-{
-	int			shift = 32 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one64 = pg_leftmost_one64_abm;
-	else
-		pg_leftmost_one64 = pg_leftmost_one64_slow;
-
-	return pg_leftmost_one64(word);
-}
-
-static int
-pg_leftmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return 63 - __builtin_clzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return 63 - __builtin_clzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-
-}
-#endif
-
-/*
- * pg_leftmost_one64_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one64_slow(uint64 word)
-{
-	int			shift = 64 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
diff --git a/src/port/pg_bitutils_sse42.c b/src/port/pg_bitutils_sse42.c
new file mode 100644
index 00000000000..9945e5c103d
--- /dev/null
+++ b/src/port/pg_bitutils_sse42.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount.c
+ *	  CPU-optimized implementation of pg_popcount
+ *
+ * This file must be compiled with a compiler-specific flag to enable the
+ * POPCNT instruction.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "port/pg_bitutils.h"
+
+int
+pg_popcount32_sse42(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+int
+pg_popcount64_sse42(uint64 word)
+{
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+}
-- 
2.17.1

#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#42)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Ah, I understand it now:
https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
SIGILL or anything ... it'll just silently compute the wrong result.
That's certainly not what I call a fallback!

Yeah, that's pretty nasty; it means there's no backstop for whether
your choose function gets it right :-(

Is POPCNT any better in this respect?

regards, tom lane

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#43)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I noticed in Compiler Explorer that some (ancient?) Power cpus
implement instruction "popcntb", and GCC support for those uses
-mpopcntb switch enabling __builtin_popcount() to use it. I added the
switch to configure.in but I'm not sure how well that will work ... I
don't know if this is represented in buildfarm.

I experimented a bit with this on an old Apple laptop. Apple's
compiler rejects -mpopcntb altogether. FreeBSD's compiler
(gcc 4.2.1) recognizes the switch, but I could not get it to
emit the instruction, even when specifying -mcpu=power5,
which ought to enable it according to the gcc docs:

... The `-mpopcntb' option allows GCC to generate the
popcount and double precision FP reciprocal estimate instruction
implemented on the POWER5 processor and other processors that
support the PowerPC V2.02 architecture.

A more recent gcc info file also mentions

The `-mpopcntd' option
allows GCC to generate the popcount instruction implemented on the
POWER7 processor and other processors that support the PowerPC
V2.06 architecture.

but the gcc version I have on this laptop doesn't know that switch.
In any case, I'm pretty sure Apple never shipped a CPU that could
run either instruction.

I suspect that probing for either option may not be worth the
configure cycles it'd consume :-( ... there are just way too
few of those specific POWER variants out there anymore, even
granting that you have a compiler that will play along.

Moreover, you can't turn on -mpopcntb without having some POWER
equivalent to the CPUID test.

However, if you want to leave the option for this open in
future, it really makes the file name pg_bitutils_sse42.c
quite inappropriate. How about pg_bitutils_hwpopcnt.c
or something like that?

regards, tom lane

#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#44)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-15, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Ah, I understand it now:
https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
SIGILL or anything ... it'll just silently compute the wrong result.
That's certainly not what I call a fallback!

Yeah, that's pretty nasty; it means there's no backstop for whether
your choose function gets it right :-(

Hopefully other tests will fail in some visible way, though. My fear is
whether we have such systems in buildfarm.

Is POPCNT any better in this respect?

I couldn't find how is POPCNT encoded. https://stackoverflow.com/a/28803917/242383

I did find these articles:
http://danluu.com/assembly-intrinsics/
https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati

This suggests that this all a largely pointless exercise at least on
Intel and GCC/Clang. It may be better on AMD ... but to get really
better performance we'd need to be coding the popcnt calls in assembly
rather than using the compiler intrinsics, even with -mpopcnt, because
the intrinsics suck.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#47Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#46)
1 attachment(s)
Re: Using POPCNT and other advanced bit manipulation instructions

Here's a final version that I intend to push shortly, to have time
before EOB today to handle any fallout.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-0001-Fix-compiler-builtin-usage-in-new-pg_bitutils.c.patchtext/x-diff; charset=us-asciiDownload
From 085650a174ff080f578bb289d3707173aaf4f07b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri, 15 Feb 2019 13:07:02 -0300
Subject: [PATCH v2] Fix compiler builtin usage in new pg_bitutils.c

Split out these new functions in three parts: one in a new file that
uses the compiler builtin and gets compiled with the -mpopcnt compiler
option if it exists; another one that uses the compiler builtin but not
the compiler option; and finally the fallback with open-coded
algorithms.

Split out the configure logic: in the original commit, it was selecting
to use the -mpopcnt compiler switch together with deciding whether to
use the compiler builtin, but those two things are really separate.
Split them out.  Also, expose whether the builtin exists to
Makefile.global, so that src/port's Makefile can decide whether to
compile the hw-optimized file.

Remove CPUID test for CTZ/CLZ.  Make pg_{right,left}most_ones use either
the compiler intrinsic or open-coded algo; trying to use the
HW-optimized version is a waste of time.  Make them static inline
functions.

Discussion: https://postgr.es/m/20190213221719.GA15976@alvherre.pgsql
---
 config/c-compiler.m4            |  22 +-
 configure                       |  66 ++++--
 configure.in                    |   6 +-
 src/Makefile.global.in          |   3 +
 src/include/port/pg_bitutils.h  | 176 ++++++++++++++-
 src/port/Makefile               |  16 +-
 src/port/pg_bitutils.c          | 378 ++++----------------------------
 src/port/pg_bitutils_hwpopcnt.c |  36 +++
 8 files changed, 327 insertions(+), 376 deletions(-)
 create mode 100644 src/port/pg_bitutils_hwpopcnt.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 05fa82518f8..7c0d52b515f 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -381,22 +381,16 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 # PGAC_C_BUILTIN_POPCOUNT
 # -------------------------
 AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcount])])dnl
-AC_CACHE_CHECK([for __builtin_popcount], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-[static int x = __builtin_popcount(255);])],
-[Ac_cachevar=yes],
-[Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
-if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+[AC_CACHE_CHECK([for __builtin_popcount], pgac_cv__builtin_popcount,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);]
+)],
+[pgac_cv__builtin_popcount=yes],
+[pgac_cv__builtin_popcount=no])])
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
           [Define to 1 if your compiler understands __builtin_popcount.])
-fi
-undefine([Ac_cachevar])dnl
-])# PGAC_C_BUILTIN_POPCOUNT
+fi])# PGAC_C_BUILTIN_POPCOUNT
 
 
 
diff --git a/configure b/configure
index 73e9c235b69..2e3cc372a6e 100755
--- a/configure
+++ b/configure
@@ -651,7 +651,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
-CFLAGS_POPCNT
+have__builtin_popcount
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -733,6 +733,7 @@ CPP
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
 CFLAGS_VECTOR
+CFLAGS_POPCNT
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
 LLVM_CXXFLAGS
@@ -6581,6 +6582,48 @@ fi
 
 fi
 
+# Optimization flags and options for bit-twiddling
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS_POPCNT} -mpopcnt"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcnt=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcnt" = x"yes"; then
+  CFLAGS_POPCNT="${CFLAGS_POPCNT} -mpopcnt"
+fi
+
+
+
+
 CFLAGS_VECTOR=$CFLAGS_VECTOR
 
 
@@ -14111,32 +14154,28 @@ $as_echo "#define HAVE__BUILTIN_CTZ 1" >>confdefs.h
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcount" >&5
 $as_echo_n "checking for __builtin_popcount... " >&6; }
-if ${pgac_cv_popcount+:} false; then :
+if ${pgac_cv__builtin_popcount+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 static int x = __builtin_popcount(255);
+
 _ACEOF
 if ac_fn_c_try_compile "$LINENO"; then :
-  pgac_cv_popcount=yes
+  pgac_cv__builtin_popcount=yes
 else
-  pgac_cv_popcount=no
+  pgac_cv__builtin_popcount=no
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_popcount" >&5
-$as_echo "$pgac_cv_popcount" >&6; }
-if test x"$pgac_cv_popcount" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_popcount" >&5
+$as_echo "$pgac_cv__builtin_popcount" >&6; }
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 
 $as_echo "#define HAVE__BUILTIN_POPCOUNT 1" >>confdefs.h
 
 fi
-
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_unreachable" >&5
 $as_echo_n "checking for __builtin_unreachable... " >&6; }
 if ${pgac_cv__builtin_unreachable+:} false; then :
@@ -14654,6 +14693,7 @@ $as_echo "#define LOCALE_T_IN_XLOCALE 1" >>confdefs.h
 
 fi
 
+have__builtin_popcount=$pgac_cv__builtin_popcount
 
 
 # MSVC doesn't cope well with defining restrict to __restrict, the
diff --git a/configure.in b/configure.in
index 9c4d5f0691e..e12d5b14f56 100644
--- a/configure.in
+++ b/configure.in
@@ -547,6 +547,10 @@ elif test "$PORTNAME" = "hpux"; then
   PGAC_PROG_CXX_CFLAGS_OPT([+Olibmerrno])
 fi
 
+# Optimization flags and options for bit-twiddling
+PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt])
+AC_SUBST(CFLAGS_POPCNT)
+
 AC_SUBST(CFLAGS_VECTOR, $CFLAGS_VECTOR)
 
 # Determine flags used to emit bitcode for JIT inlining. Need to test
@@ -1506,7 +1510,7 @@ AC_TYPE_LONG_LONG_INT
 
 PGAC_TYPE_LOCALE_T
 
-AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(have__builtin_popcount, $pgac_cv__builtin_popcount)
 
 # MSVC doesn't cope well with defining restrict to __restrict, the
 # spelling it understands, because it conflicts with
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index aa16da3e0f2..0f4dd195845 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -517,6 +517,9 @@ WIN32_STACK_RLIMIT=4194304
 # Set if we have a working win32 crashdump header
 have_win32_dbghelp = @have_win32_dbghelp@
 
+# Set if __builtin_popcount() is supported by $(CC)
+have__builtin_popcount = @have__builtin_popcount@
+
 # Pull in platform-specific magic
 include $(top_builddir)/src/Makefile.port
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 148c5550573..70aae5128fa 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -10,17 +10,177 @@
  *
  *------------------------------------------------------------------------ -
  */
-
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
-extern int (*pg_popcount32) (uint32 word);
-extern int (*pg_popcount64) (uint64 word);
-extern int (*pg_rightmost_one32) (uint32 word);
-extern int (*pg_rightmost_one64) (uint64 word);
-extern int (*pg_leftmost_one32) (uint32 word);
-extern int (*pg_leftmost_one64) (uint64 word);
-
+extern int	(*pg_popcount32) (uint32 word);
+extern int	(*pg_popcount64) (uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
+/* in pg_bitutils_hwpopcnt.c */
+extern int	pg_popcount32_hw(uint32 word);
+extern int	pg_popcount64_hw(uint64 word);
+
+
+#ifndef HAVE__BUILTIN_CTZ
+/*
+ * Array marking the position of the right-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 rightmost_one_pos[256] = {
+	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
+	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
+};
+#endif							/* !HAVE__BUILTIN_CTZ */
+
+/*
+ * pg_rightmost_one32
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one32(uint32 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+	result = __builtin_ctz(word);
+#else
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif							/* HAVE__BUILTIN_CTZ */
+
+	return result;
+}
+
+/*
+ * pg_rightmost_one64
+ *		Returns the number of trailing 0-bits in word, starting at the least
+ *		significant bit position. word must not be 0.
+ */
+static inline int
+pg_rightmost_one64(uint64 word)
+{
+	int			result = 0;
+
+	Assert(word != 0);
+#ifdef HAVE__BUILTIN_CTZ
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_ctzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_ctzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CTZ */
+	while ((word & 255) == 0)
+	{
+		word >>= 8;
+		result += 8;
+	}
+	result += rightmost_one_pos[word & 255];
+#endif
+
+	return result;
+}
+
+#ifndef HAVE__BUILTIN_CLZ
+/*
+ * Array marking the position of the left-most set bit for each value of
+ * 1-255.  We count the right-most position as the 0th bit, and the
+ * left-most the 7th bit.  The 0th index of the array must not be used.
+ */
+static const uint8 leftmost_one_pos[256] = {
+	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
+	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
+};
+#endif							/* !HAVE_BUILTIN_CLZ */
+
+/*
+ * pg_leftmost_one32
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one32(uint32 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	Assert(word != 0);
+
+	return 31 - __builtin_clz(word);
+#else
+	int			shift = 32 - 8;
+
+	Assert(word != 0);
+
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* HAVE__BUILTIN_CLZ */
+}
+
+/*
+ * pg_leftmost_one64
+ *		Returns the 0-based position of the most significant set bit in word
+ *		measured from the least significant bit.  word must not be 0.
+ */
+static inline int
+pg_leftmost_one64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	Assert(word != 0);
+#if defined(HAVE_LONG_INT_64)
+	return 63 - __builtin_clzl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return 63 - __builtin_clzll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+#else							/* HAVE__BUILTIN_CLZ */
+	int			shift = 64 - 8;
+
+	Assert(word != 0);
+	while ((word >> shift) == 0)
+		shift -= 8;
+
+	return shift + leftmost_one_pos[(word >> shift) & 255];
+#endif							/* !HAVE__BUIILTIN_CLZ */
+}
+
 #endif							/* PG_BITUTILS_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 2da73260a13..a7f8fd2e668 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -41,6 +41,14 @@ OBJS = $(LIBOBJS) $(PG_CRC32C_OBJS) chklocale.o erand48.o inet_net_ntop.o \
 	qsort.o qsort_arg.o quotes.o snprintf.o sprompt.o strerror.o \
 	tar.o thread.o
 
+# If the compiler supports a special flag for the POPCOUNT instruction and it
+# has __builtin_popcount, add pg_bitutils_hwpopcnt.o.
+ifneq ($(CFLAGS_POPCNT),)
+ifeq ($(have__builtin_popcount),yes)
+OBJS += pg_bitutils_hwpopcnt.o
+endif
+endif
+
 # libpgport.a, libpgport_shlib.a, and libpgport_srv.a contain the same files
 # foo.o, foo_shlib.o, and foo_srv.o are all built from foo.c
 OBJS_SHLIB = $(OBJS:%.o=%_shlib.o)
@@ -78,10 +86,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
 
-# pg_bitutils.c needs CFLAGS_POPCNT
-pg_bitutils.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_bitutils_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+# all versions of pg_bitutils_hwpopcnt.c need CFLAGS_POPCNT
+pg_bitutils_hwpopcnt.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_bitutils_hwpopcnt_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_bitutils_hwpopcnt_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
 
 #
 # Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index aac394fe927..97bfcebe4e1 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -10,7 +10,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"
 
 #ifdef HAVE__GET_CPUID
@@ -23,61 +22,21 @@
 
 #include "port/pg_bitutils.h"
 
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
+#ifdef HAVE__BUILTIN_POPCOUNT
 static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount32_sse42(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount64_sse42(uint64 word);
-#endif
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-static bool pg_lzcnt_available(void);
-#endif
-
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-static int pg_rightmost_one32_choose(uint32 word);
-static int pg_rightmost_one32_abm(uint32 word);
-static int pg_rightmost_one64_choose(uint64 word);
-static int pg_rightmost_one64_abm(uint64 word);
-#endif
-static int pg_rightmost_one32_slow(uint32 word);
-static int pg_rightmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-static int pg_leftmost_one32_choose(uint32 word);
-static int pg_leftmost_one32_abm(uint32 word);
-static int pg_leftmost_one64_choose(uint64 word);
-static int pg_leftmost_one64_abm(uint64 word);
-#endif
-static int pg_leftmost_one32_slow(uint32 word);
-static int pg_leftmost_one64_slow(uint64 word);
-
-#if defined(HAVE__BUILTIN_POPCOUNT) && defined(HAVE__GET_CPUID)
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+static int	pg_popcount32_choose(uint32 word);
+static int	pg_popcount32_builtin(uint32 word);
+static int	pg_popcount64_choose(uint64 word);
+static int	pg_popcount64_builtin(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 #else
-int (*pg_popcount32) (uint32 word) = pg_popcount32_slow;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_slow;
-#endif
+static int	pg_popcount32_slow(uint32 word);
+static int	pg_popcount64_slow(uint64 word);
+int			(*pg_popcount32) (uint32 word) = pg_popcount32_slow;
+int			(*pg_popcount64) (uint64 word) = pg_popcount64_slow;
+#endif							/* !HAVE_BUILTIN_POPCOUNT */
 
-#if defined(HAVE__BUILTIN_CTZ) && defined(HAVE__GET_CPUID)
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_choose;
-#else
-int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
-int (*pg_rightmost_one64) (uint64 word) = pg_rightmost_one64_slow;
-#endif
-
-#if defined(HAVE__BUILTIN_CLZ) && defined(HAVE__GET_CPUID)
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_choose;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_choose;
-#else
-int (*pg_leftmost_one32) (uint32 word) = pg_leftmost_one32_slow;
-int (*pg_leftmost_one64) (uint64 word) = pg_leftmost_one64_slow;
-#endif
 
 /* Array marking the number of 1-bits for each value of 0-255. */
 static const uint8 number_of_ones[256] = {
@@ -100,96 +59,51 @@ static const uint8 number_of_ones[256] = {
 };
 
 /*
- * Array marking the position of the right-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
+ * Return true iff we have CPUID support and it indicates that the POPCNT
+ * instruction is available.
  */
-static const uint8 rightmost_one_pos[256] = {
-	0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
-	4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
-};
-
-/*
- * Array marking the position of the left-most set bit for each value of
- * 1-255.  We count the right-most position as the 0th bit, and the
- * left-most the 7th bit.  The 0th index of the array must not be used.
- */
-static const uint8 leftmost_one_pos[256] = {
-	0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
-	4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
-	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7
-};
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
 static bool
 pg_popcount_available(void)
 {
-	unsigned int exx[4] = { 0, 0, 0, 0 };
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+	unsigned int exx[4] = {0, 0, 0, 0};
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
 #endif
 
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
+#else							/* HAVE__GET_CPUID || HAVE__CPUID */
+
+	return false;
 #endif
+}
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount32. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount32_choose(uint32 word)
 {
 	if (pg_popcount_available())
-		pg_popcount32 = pg_popcount32_sse42;
+		pg_popcount32 = pg_popcount32_hw;
 	else
-		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount32 = pg_popcount32_builtin;
 
 	return pg_popcount32(word);
 }
 
 static int
-pg_popcount32_sse42(uint32 word)
+pg_popcount32_builtin(uint32 word)
 {
 	return __builtin_popcount(word);
 }
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount32_slow
  *		Return the number of 1 bits set in word
@@ -197,7 +111,7 @@ pg_popcount32_sse42(uint32 word)
 static int
 pg_popcount32_slow(uint32 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -207,6 +121,7 @@ pg_popcount32_slow(uint32 word)
 
 	return result;
 }
+#endif
 
 /*
  * pg_popcount
@@ -215,13 +130,13 @@ pg_popcount32_slow(uint32 word)
 uint64
 pg_popcount(const char *buf, int bytes)
 {
-	uint64 popcnt = 0;
+	uint64		popcnt = 0;
 
 #if SIZEOF_VOID_P >= 8
 	/* Process in 64-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(8, buf))
 	{
-		uint64 *words = (uint64 *) buf;
+		uint64	   *words = (uint64 *) buf;
 
 		while (bytes >= 8)
 		{
@@ -235,7 +150,7 @@ pg_popcount(const char *buf, int bytes)
 	/* Process in 32-bit chunks if the buffer is aligned. */
 	if (buf == (char *) TYPEALIGN(4, buf))
 	{
-		uint32 *words = (uint32 *) buf;
+		uint32	   *words = (uint32 *) buf;
 
 		while (bytes >= 4)
 		{
@@ -254,38 +169,36 @@ pg_popcount(const char *buf, int bytes)
 	return popcnt;
 }
 
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT)
-
+#ifdef HAVE__BUILTIN_POPCOUNT
 /*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
+ * This gets called on the first call to pg_popcount64. It replaces the
+ * function pointer so that subsequent calls are routed directly to the chosen
+ * implementation.
  */
 static int
 pg_popcount64_choose(uint64 word)
 {
 	if (pg_popcount_available())
-		pg_popcount64 = pg_popcount64_sse42;
+		pg_popcount64 = pg_popcount64_hw;
 	else
-		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount64 = pg_popcount64_builtin;
 
 	return pg_popcount64(word);
 }
 
 static int
-pg_popcount64_sse42(uint64 word)
+pg_popcount64_builtin(uint64 word)
 {
 #if defined(HAVE_LONG_INT_64)
 	return __builtin_popcountl(word);
 #elif defined(HAVE_LONG_LONG_INT_64)
 	return __builtin_popcountll(word);
 #else
-	/* shouldn't happen */
 #error must have a working 64-bit integer datatype
 #endif
 }
 
-#endif
-
+#else							/* HAVE__BUILTIN_POPCOUNT */
 /*
  * pg_popcount64_slow
  *		Return the number of 1 bits set in word
@@ -293,7 +206,7 @@ pg_popcount64_sse42(uint64 word)
 static int
 pg_popcount64_slow(uint64 word)
 {
-	int result = 0;
+	int			result = 0;
 
 	while (word != 0)
 	{
@@ -303,211 +216,4 @@ pg_popcount64_slow(uint64 word)
 
 	return result;
 }
-
-#if defined(HAVE__GET_CPUID) && (defined(HAVE__BUILTIN_CTZ) || defined(HAVE__BUILTIN_CLZ))
-
-static bool
-pg_lzcnt_available(void)
-{
-
-	unsigned int exx[4] = { 0, 0, 0, 0 };
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(0x80000001, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 0x80000001);
-#else
-#error cpuid instruction not available
 #endif
-
-	return (exx[2] & (1 << 5)) != 0;	/* LZCNT */
-}
-#endif
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one32 = pg_rightmost_one32_abm;
-	else
-		pg_rightmost_one32 = pg_rightmost_one32_slow;
-
-	return pg_rightmost_one32(word);
-}
-
-static int
-pg_rightmost_one32_abm(uint32 word)
-{
-	return __builtin_ctz(word);
-}
-
-#endif
-
-/*
- * pg_rightmost_one32_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one32_slow(uint32 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CTZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_rightmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_rightmost_one64 = pg_rightmost_one64_abm;
-	else
-		pg_rightmost_one64 = pg_rightmost_one64_slow;
-
-	return pg_rightmost_one64(word);
-}
-
-static int
-pg_rightmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return __builtin_ctzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return __builtin_ctzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-}
-#endif
-
-/*
- * pg_rightmost_one64_slow
- *		Returns the number of trailing 0-bits in word, starting at the least
- *		significant bit position. word must not be 0.
- */
-static int
-pg_rightmost_one64_slow(uint64 word)
-{
-	int result = 0;
-
-	Assert(word != 0);
-
-	while ((word & 255) == 0)
-	{
-		word >>= 8;
-		result += 8;
-	}
-	result += rightmost_one_pos[word & 255];
-
-	return result;
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one32_choose(uint32 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one32 = pg_leftmost_one32_abm;
-	else
-		pg_leftmost_one32 = pg_leftmost_one32_slow;
-
-	return pg_leftmost_one32(word);
-}
-
-static int
-pg_leftmost_one32_abm(uint32 word)
-{
-	return 31 - __builtin_clz(word);
-}
-#endif
-
-/*
- * pg_leftmost_one32_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one32_slow(uint32 word)
-{
-	int			shift = 32 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
-
-#if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_CLZ)
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static int
-pg_leftmost_one64_choose(uint64 word)
-{
-	if (pg_lzcnt_available())
-		pg_leftmost_one64 = pg_leftmost_one64_abm;
-	else
-		pg_leftmost_one64 = pg_leftmost_one64_slow;
-
-	return pg_leftmost_one64(word);
-}
-
-static int
-pg_leftmost_one64_abm(uint64 word)
-{
-#if defined(HAVE_LONG_INT_64)
-	return 63 - __builtin_clzl(word);
-#elif defined(HAVE_LONG_LONG_INT_64)
-	return 63 - __builtin_clzll(word);
-#else
-	/* shouldn't happen */
-#error must have a working 64-bit integer datatype
-#endif
-
-}
-#endif
-
-/*
- * pg_leftmost_one64_slow
- *		Returns the 0-based position of the most significant set bit in word
- *		measured from the least significant bit.  word must not be 0.
- */
-static int
-pg_leftmost_one64_slow(uint64 word)
-{
-	int			shift = 64 - 8;
-
-	Assert(word != 0);
-
-	while ((word >> shift) == 0)
-		shift -= 8;
-
-	return shift + leftmost_one_pos[(word >> shift) & 255];
-}
diff --git a/src/port/pg_bitutils_hwpopcnt.c b/src/port/pg_bitutils_hwpopcnt.c
new file mode 100644
index 00000000000..516efd586dd
--- /dev/null
+++ b/src/port/pg_bitutils_hwpopcnt.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_bitutils_hwpopcnt.c
+ *	  CPU-optimized implementation of pg_popcount variants
+ *
+ * This file must be compiled with a compiler-specific flag to enable the
+ * POPCNT instruction.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_bitutils_hwpopcnt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "port/pg_bitutils.h"
+
+int
+pg_popcount32_hw(uint32 word)
+{
+	return __builtin_popcount(word);
+}
+
+int
+pg_popcount64_hw(uint64 word)
+{
+#if defined(HAVE_LONG_INT_64)
+	return __builtin_popcountl(word);
+#elif defined(HAVE_LONG_LONG_INT_64)
+	return __builtin_popcountll(word);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+}
-- 
2.17.1

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#47)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Here's a final version that I intend to push shortly, to have time
before EOB today to handle any fallout.

I think this is likely to result in a lot of complaints about
rightmost_one_pos[] being unreferenced, in non-HAVE__BUILTIN_CTZ
builds. Probably that has to be an extern rather than static
in the header. leftmost_one_pos[] likewise.

I might have a go at improving the configure tests later ---
I still don't like that they're compile-time-optimizable.
But that can wait.

regards, tom lane

#49Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#35)
Re: Using POPCNT and other advanced bit manipulation instructions

Hi,

On 2019-02-14 16:45:38 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:

Hah, I just realized you have to add -mlzcnt in order for these builtins
to use the lzcnt instructions. It goes from something like

bsrq %rax, %rax
xorq $63, %rax

I'm confused how this is a general count leading zero operation? Did you
use constants or something that allowed ot infer a range in the test? If
so the compiler probably did some optimizations allowing it to do the
above.

No. If you compile

int myclz(unsigned long long x)
{
return __builtin_clzll(x);
}

at -O2, on just about any x86_64 gcc, you will get

myclz:
.LFB1:
.cfi_startproc
bsrq %rdi, %rax
xorq $63, %rax
ret
.cfi_endproc

Yea, sorry for the noise. I misremembered the bsrq mnemonic.

bsr has a latency of three cycles, xor of one. lzcnt a latency of
three. So it's mildly faster to use lzcnt (it uses fewer ports, and has
a shorter latency). But I doubt we have code where that's noticable.

Greetings,

Andres Freund