Optimize Arm64 crc32c implementation in Postgresql

Started by Yuqi Guabout 8 years ago39 messages

Yuqi.Gu@arm.com

about 8 years ago

1 attachment(s)

Hi all

Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.
Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented by inline assembly,
so they can also benefit from hardware acceleration in IO-intensive workloads.

I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.
The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.
And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.

I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.

BRs,
Yuqi
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0001-Optimize-Arm64-crc32c-implementation-in-Postgresql.patchapplication/octet-stream; name=0001-Optimize-Arm64-crc32c-implementation-in-Postgresql.patchDownload

From d41ac63a6f4dc71df5894a31a9d0b0b5572816ae Mon Sep 17 00:00:00 2001
From: Yuqi Gu <yuqi.gu@arm.com>
Date: Mon, 8 Jan 2018 03:03:31 +0000
Subject: [PATCH] Optimize Arm64 crc32c implementation in Postgresql

Providing the ARM64v8 crc32 Interfaces to optimize the performance on ARM64 Platform.

Change-Id: I3af7e7e6a9f36936e7c16c5863a7c3e87e911cbf
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
---
 config/c-compiler.m4         | 15 ++++++++
 configure                    | 82 ++++++++++++++++++++++++++++++++++++++------
 configure.in                 | 24 +++++++++----
 src/include/pg_config.h.in   |  3 ++
 src/include/port/pg_crc32c.h | 10 +++++-
 src/port/pg_crc32c_choose.c  | 20 +++++++++++
 src/port/pg_crc32c_sb8.c     | 47 +++++++++++++++++++++++++
 7 files changed, 183 insertions(+), 18 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 076656c..9cd6270 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -595,3 +595,18 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
+
+AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS],
+[AC_CACHE_CHECK([for Arm64ce CRC32], [pgac_cv_arm64ce_crc32_intrinsics],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [unsigned int arm_flag = 0;
+#if defined(__ARM_ARCH) && (__ARM_ARCH > 7)
+   arm_flag = 1;
+#endif
+   return arm_flag == 1;])],
+  [pgac_cv_arm64ce_crc32_intrinsics="yes"],
+  [pgac_cv_arm64ce_crc32_intrinsics="no"])])
+if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then
+  pgac_arm64ce_crc32_intrinsics=yes
+fi
+])# PGAC_ARM64CE_CRC32_INTRINSICS
diff --git a/configure b/configure
index 45221e1..1c7f0b3 100755
--- a/configure
+++ b/configure
@@ -777,6 +777,7 @@ infodir
 docdir
 oldincludedir
 includedir
+runstatedir
 localstatedir
 sharedstatedir
 sysconfdir
@@ -904,6 +905,7 @@ datadir='${datarootdir}'
 sysconfdir='${prefix}/etc'
 sharedstatedir='${prefix}/com'
 localstatedir='${prefix}/var'
+runstatedir='${localstatedir}/run'
 includedir='${prefix}/include'
 oldincludedir='/usr/include'
 docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -1156,6 +1158,15 @@ do
   | -silent | --silent | --silen | --sile | --sil)
     silent=yes ;;
 
+  -runstatedir | --runstatedir | --runstatedi | --runstated \
+  | --runstate | --runstat | --runsta | --runst | --runs \
+  | --run | --ru | --r)
+    ac_prev=runstatedir ;;
+  -runstatedir=* | --runstatedir=* | --runstatedi=* | --runstated=* \
+  | --runstate=* | --runstat=* | --runsta=* | --runst=* | --runs=* \
+  | --run=* | --ru=* | --r=*)
+    runstatedir=$ac_optarg ;;
+
   -sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb)
     ac_prev=sbindir ;;
   -sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \
@@ -1293,7 +1304,7 @@ fi
 for ac_var in	exec_prefix prefix bindir sbindir libexecdir datarootdir \
 		datadir sysconfdir sharedstatedir localstatedir includedir \
 		oldincludedir docdir infodir htmldir dvidir pdfdir psdir \
-		libdir localedir mandir
+		libdir localedir mandir runstatedir
 do
   eval ac_val=\$$ac_var
   # Remove trailing slashes.
@@ -1446,6 +1457,7 @@ Fine tuning of the installation directories:
   --sysconfdir=DIR        read-only single-machine data [PREFIX/etc]
   --sharedstatedir=DIR    modifiable architecture-independent data [PREFIX/com]
   --localstatedir=DIR     modifiable single-machine data [PREFIX/var]
+  --runstatedir=DIR       modifiable per-process data [LOCALSTATEDIR/run]
   --libdir=DIR            object code libraries [EPREFIX/lib]
   --includedir=DIR        C header files [PREFIX/include]
   --oldincludedir=DIR     C header files for non-gcc [/usr/include]
@@ -12655,7 +12667,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -12701,7 +12713,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -12725,7 +12737,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -12770,7 +12782,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -12794,7 +12806,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15449,6 +15461,41 @@ if ac_fn_c_try_compile "$LINENO"; then :
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for Arm64ce CRC32" >&5
+$as_echo_n "checking for Arm64ce CRC32... " >&6; }
+if ${pgac_cv_arm64ce_crc32_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+unsigned int arm_flag = 0;
+#if defined(__ARM_ARCH) && (__ARM_ARCH > 7)
+   arm_flag = 1;
+#endif
+   return arm_flag == 1;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm64ce_crc32_intrinsics="yes"
+else
+  pgac_cv_arm64ce_crc32_intrinsics="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm64ce_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_arm64ce_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then
+  pgac_arm64ce_crc32_intrinsics=yes
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has SSE 4.2 instructions, we can use the
@@ -15468,9 +15515,13 @@ if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHEC
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      if test x"$pgac_arm64ce_crc32_intrinsics" = x"yes"; then
+        USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK=1
+      else
+        # fall back to slicing-by-8 algorithm which doesn't require any special
+        # CPU support.
+        USE_SLICING_BY_8_CRC32C=1
+      fi
     fi
   fi
 fi
@@ -15494,12 +15545,21 @@ $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
   else
+    if test x"$USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+      PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARM64 CE with runtime check" >&5
+$as_echo "ARM64 CE with runtime check" >&6; }
+    else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+    fi
   fi
 fi
 
diff --git a/configure.in b/configure.in
index 4d26034..84ebf53 100644
--- a/configure.in
+++ b/configure.in
@@ -1900,6 +1900,8 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
 #endif
 ])], [SSE4_2_TARGETED=1])
 
+PGAC_ARM64CE_CRC32_INTRINSICS
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has SSE 4.2 instructions, we can use the
@@ -1919,9 +1921,13 @@ if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHEC
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      if test x"$pgac_arm64ce_crc32_intrinsics" = x"yes"; then
+        USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK=1
+      else
+        # fall back to slicing-by-8 algorithm which doesn't require any special
+        # CPU support.
+        USE_SLICING_BY_8_CRC32C=1
+      fi
     fi
   fi
 fi
@@ -1938,9 +1944,15 @@ else
     PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_choose.o"
     AC_MSG_RESULT(SSE 4.2 with runtime check)
   else
-    AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    AC_MSG_RESULT(slicing-by-8)
+    if test x"$USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM64 CE CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_choose.o"
+      AC_MSG_RESULT(ARM64 CE with runtime check)
+    else
+      AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+      AC_MSG_RESULT(slicing-by-8)
+    fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f98f773..ae2cdf1 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -871,6 +871,9 @@
 /* Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARM64 CRC instructions with a runtime check. */
+#undef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ae2701e..50405a5 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,7 +49,8 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)
+
 /*
  * Use SSE4.2 instructions, but perform a runtime check first to check that
  * they are available.
@@ -62,6 +63,13 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 
+/* Correspondence with pg_com_crc32c_sb8
+ * Arm64 using Castagnoli polynomial 0x1EDC6F41: crc32c
+ */
+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+extern pg_crc32c pg_comp_crc32c_arm64(pg_crc32c crc, const void *data, size_t len);
+#endif
+
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c
index 40bee67..d3682ad 100644
--- a/src/port/pg_crc32c_choose.c
+++ b/src/port/pg_crc32c_choose.c
@@ -29,6 +29,20 @@
 
 #include "port/pg_crc32c.h"
 
+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+#ifndef HWCAP_CRC32
+#define HWCAP_CRC32 (1 << 7)
+#endif
+
+static bool
+pg_crc32c_arm64ce_available(void) {
+	unsigned long auxv = getauxval(AT_HWCAP);
+	return (auxv & HWCAP_CRC32) != 0;
+}
+
+#else
 static bool
 pg_crc32c_sse42_available(void)
 {
@@ -44,6 +58,7 @@ pg_crc32c_sse42_available(void)
 
 	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
 }
+#endif
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -52,8 +67,13 @@ pg_crc32c_sse42_available(void)
 static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)
+	if (pg_crc32c_arm64ce_available())
+		pg_comp_crc32c = pg_comp_crc32c_arm64;
+#else
 	if (pg_crc32c_sse42_available())
 		pg_comp_crc32c = pg_comp_crc32c_sse42;
+#endif
 	else
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
 
diff --git a/src/port/pg_crc32c_sb8.c b/src/port/pg_crc32c_sb8.c
index 5205ba9..fd9dd93 100644
--- a/src/port/pg_crc32c_sb8.c
+++ b/src/port/pg_crc32c_sb8.c
@@ -22,6 +22,53 @@
 
 #include "port/pg_crc32c.h"
 
+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)
+asm(".arch_extension crc");
+#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" : [a]"=r"(x),[b]"=r"(y),[c]"+r"(p))
+/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */
+#define CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CW(crc,value) asm("crc32cw %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+
+pg_crc32c
+pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) {
+	uint64 p0, p1;
+	pg_crc32c crc32_c = crc;
+	long length = len;
+	const unsigned char *p_buf = data;
+
+	/* Allow crc instructions in asm */
+	asm(".cpu generic+crc");
+	while ((length -= 2*sizeof(uint64)) >= 0) {
+		LDP(p0, p1, p_buf);
+		CRC32CX(crc32_c,p0);
+		CRC32CX(crc32_c,p1);
+	}
+
+	if (length & sizeof(uint64)) {
+		CRC32CX(crc32_c, *(uint64*)p_buf);
+		p_buf += sizeof(uint64);
+	}
+
+	if (length & sizeof(uint32)) {
+		CRC32CW(crc32_c, *(uint64*)p_buf);
+		p_buf += sizeof(uint32);
+	}
+
+	if (length & sizeof(uint16)) {
+		CRC32CH(crc32_c, *(uint16*)p_buf);
+		p_buf += sizeof(uint16);
+	}
+
+	if (length & sizeof(uint8)) {
+		CRC32CB(crc32_c, *p_buf);
+	}
+
+	return crc32_c;
+}
+#endif
+
 static const uint32 pg_crc32c_table[8][256];
 
 /* Accumulate one input byte */
-- 
2.7.4

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Yuqi Gu (#1)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Wed, Jan 10, 2018 at 05:58:19AM +0000, Yuqi Gu wrote:

I would like to propose the patch to optimize crc32c calculation with
Arm64 specific instructions. The hardware-specific code
implementation is used under #if defined
USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK. And the performance is improved
on platforms: cortex-A57, cortex-A72, cortex-A73, etc.

I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.

Nice! I have not looked at your patch, but +1. There are not enough
patches for ARM.
--
Michael

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Yuqi Gu (#1)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:

Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.
Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented by inline assembly,
so they can also benefit from hardware acceleration in IO-intensive workloads.

I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.
The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.
And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.

I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.

Could you show whether it actually improves performance? Usually bulk
loading data with parallel COPYs is a good way to hit this codepath.

+
+AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS],
+[AC_CACHE_CHECK([for Arm64ce CRC32], [pgac_cv_arm64ce_crc32_intrinsics],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [unsigned int arm_flag = 0;
+#if defined(__ARM_ARCH) && (__ARM_ARCH > 7)
+   arm_flag = 1;
+#endif
+   return arm_flag == 1;])],
+  [pgac_cv_arm64ce_crc32_intrinsics="yes"],
+  [pgac_cv_arm64ce_crc32_intrinsics="no"])])
+if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then
+  pgac_arm64ce_crc32_intrinsics=yes
+fi
+])# PGAC_ARM64CE_CRC32_INTRINSICS

I don't understand what this check is supposed to be doing?
AC_LINK_IFELSE doesn't run the program, so I don't see this test
achieving anything at all?

* Use slicing-by-8 algorithm.
diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c
index 40bee67..d3682ad 100644
--- a/src/port/pg_crc32c_choose.c
+++ b/src/port/pg_crc32c_choose.c
@@ -29,6 +29,20 @@

#include "port/pg_crc32c.h"

+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+#ifndef HWCAP_CRC32
+#define HWCAP_CRC32 (1 << 7)
+#endif

+static bool
+pg_crc32c_arm64ce_available(void) {
+	unsigned long auxv = getauxval(AT_HWCAP);
+	return (auxv & HWCAP_CRC32) != 0;
+}
+
+#else

What's the availability of these headers and functions on non-linux platforms?

+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)
+asm(".arch_extension crc");

So this emitted globally?

+#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" : [a]"=r"(x),[b]"=r"(y),[c]"+r"(p))
+/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */
+#define CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CW(crc,value) asm("crc32cw %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+
+pg_crc32c
+pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) {
+	uint64 p0, p1;
+	pg_crc32c crc32_c = crc;
+	long length = len;
+	const unsigned char *p_buf = data;
+
+	/* Allow crc instructions in asm */
+	asm(".cpu generic+crc");

Hm, this switches it for the rest of the function, program, ...?

Greetings,

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Michael Paquier (#2)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-01-10 15:09:16 +0900, Michael Paquier wrote:

There are not enough patches for ARM.

Nah, not needing arch specific patches is good ;)

Greetings,

Andres Freund

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Andres Freund (#3)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Fri, Mar 2, 2018 at 10:36 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:

+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+#ifndef HWCAP_CRC32
+#define HWCAP_CRC32 (1 << 7)
+#endif

+static bool
+pg_crc32c_arm64ce_available(void) {
+     unsigned long auxv = getauxval(AT_HWCAP);
+     return (auxv & HWCAP_CRC32) != 0;
+}
+
+#else

What's the availability of these headers and functions on non-linux platforms?

FWIW I don't think that'll work on FreeBSD. I don't have an arm64
system to test on right now, but I can see that there is no
getauxval() like glibc's. FreeBSD *might* provide the same sort of
information via procstat_getauxv() from libprocstat, but I think maybe
not because I don't see any trace of HWCAP_CRC32 in the tree and I see
a different approach to testing the CPU ID registers in eg
libkern/crc32.c.

So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?

While I'm looking at this:

-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))

Why? Doesn't something << 62 have the same value and type as
(something << 31) << 31?

+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]

What is this for?

+        if (length & sizeof(uint16)) {
+            CRC32CH(crc32_c, *(uint16*)p_buf);
+            p_buf += sizeof(uint16);
+        }
+
+        if (length & sizeof(uint8)) {
+            CRC32CB(crc32_c, *p_buf);
+        }

From the department of trivialities, our coding style has braces like this:

if (length & sizeof(uint16))
{
CRC32CH(crc32_c, *(uint16*)p_buf);
p_buf += sizeof(uint16);
}

if (length & sizeof(uint8))
CRC32CB(crc32_c, *p_buf);

--
Thomas Munro
http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Thomas Munro (#5)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:

So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?

It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.

While I'm looking at this:
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
Why? Doesn't something << 62 have the same value and type as
(something << 31) << 31?

+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]

What is this for?

Those probably are damage from using a distribution autoconf rather than
stock autoconf.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Andres Freund (#4)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, Mar 01, 2018 at 01:36:23PM -0800, Andres Freund wrote:

On 2018-01-10 15:09:16 +0900, Michael Paquier wrote:

There are not enough patches for ARM.

Nah, not needing arch specific patches is good ;)

You know already that I am always impressed by your skills in
normalizing things where needed.
--
Michael

Yuqi Gu

Yuqi.Gu@arm.com

almost 8 years ago

In reply to: Andres Freund (#3)

RE: Optimize Arm64 crc32c implementation in Postgresql

Hi,

Thanks for all your comments.

Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this code path.

The mini benchmark code:

#include <string.h>

#include <stdlib.h>

#include <stdint.h>

#include <sys/time.h>

#include "port/pg_crc32c.h"

long int GetTickCount() {

struct timeval tv;

gettimeofday(&tv, NULL);

return tv.tv_sec * 1000000 + tv.tv_usec;

}

int main() {

static const uint64_t kSize = 1024 * 1024 + 29;

uint8_t* buf = (uint8_t *)malloc(sizeof(uint8_t) * kSize);

uint32_t i;

srand(0);

for (i = 0; i < kSize; i++) {

buf[i] = (uint8_t)(rand() % 256u);

}

uint32_t kLoop = 1024;

long int start, end;

uint32_t crc = 0xFFFFFFFF;

start = GetTickCount();

for (i = 0; i < kLoop; i++) {

COMP_CRC32C(crc, buf, kSize);

}

end = GetTickCount();

crc ^= 0xFFFFFFFF;

if (kSize < 20) {

for (i = 0; i < kSize; i++) {

printf("%3u,", (uint32_t)buf[i]);

}

printf("\n");

}

printf("crc result = %x, time cost per loop:%f ms\n", crc, (double)(end - start) / kLoop);

free(buf);

return 0;

}

The result shows that optimization get x4.5 speedups on Cortex-A72.

What's the availability of these headers and functions on non-linux platforms?

This Arm64 optimization code only supports linux os so far.

asm(".arch_extension crc");

#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))

It just adds crc flag for Arm gnu gcc compiler.

'crc32cb' might not be recognized in windows or xxxBSD.

So, is it reasonable to add an OS check for Arm64 crc32 optimization.

It means the application calls Arm64 crc32 interface only in linux based OS.

From the department of trivialities, our coding style has braces like this:

if (length & sizeof(uint16))

{

CRC32CH(crc32_c, *(uint16*)p_buf);

p_buf += sizeof(uint16);

}

if (length & sizeof(uint8))

CRC32CB(crc32_c, *p_buf);

Got it. I'll modify it.

+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]

What is this for?

Those probably are damage from using a distribution autoconf rather than stock autoconf.

What does it mean "stock autoconf" ?

Should the configure script be made by specific version autoconf ?

Thanks!

BRs,

Yuqi

-----Original Message-----
From: Andres Freund [mailto:andres@anarazel.de]
Sent: Friday, March 2, 2018 5:36 AM
To: Yuqi Gu <Yuqi.Gu@arm.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:

Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.

Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is

implemented by inline assembly, so they can also benefit from hardware acceleration in IO-intensive workloads.

I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.

The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.

And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.

I'll create a CommitFests ticket for this submission.

Any comments or feedback are welcome.

Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this codepath.

+

+AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS],

+[AC_CACHE_CHECK([for Arm64ce CRC32],

+[pgac_cv_arm64ce_crc32_intrinsics],

+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],

+ [unsigned int arm_flag = 0;

+#if defined(__ARM_ARCH) && (__ARM_ARCH > 7)

+ arm_flag = 1;

+#endif

+ return arm_flag == 1;])],

+ [pgac_cv_arm64ce_crc32_intrinsics="yes"],

+ [pgac_cv_arm64ce_crc32_intrinsics="no"])])

+if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then

+ pgac_arm64ce_crc32_intrinsics=yes

+fi

+])# PGAC_ARM64CE_CRC32_INTRINSICS

I don't understand what this check is supposed to be doing?

AC_LINK_IFELSE doesn't run the program, so I don't see this test achieving anything at all?

* Use slicing-by-8 algorithm.

diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c

index 40bee67..d3682ad 100644

--- a/src/port/pg_crc32c_choose.c

+++ b/src/port/pg_crc32c_choose.c

@@ -29,6 +29,20 @@

#include "port/pg_crc32c.h"

+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK

+#include <sys/auxv.h>

+#include <asm/hwcap.h>

+#ifndef HWCAP_CRC32

+#define HWCAP_CRC32 (1 << 7)

+#endif

+static bool

+pg_crc32c_arm64ce_available(void) {

+ unsigned long auxv = getauxval(AT_HWCAP);

+ return (auxv & HWCAP_CRC32) != 0;

+}

+

+#else

What's the availability of these headers and functions on non-linux platforms?

+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)

+asm(".arch_extension crc");

So this emitted globally?

+#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" :

+[a]"=r"(x),[b]"=r"(y),[c]"+r"(p))

+/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */ #define

+CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : [c]"+r"(*&crc)

+: [v]"r"(+value)) #define CRC32CW(crc,value) asm("crc32cw %w[c],

+%w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) #define

+CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : [c]"+r"(*&crc)

+: [v]"r"(+value)) #define CRC32CB(crc,value) asm("crc32cb %w[c],

+%w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))

+

+pg_crc32c

+pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) {

+ uint64 p0, p1;

+ pg_crc32c crc32_c = crc;

+ long length = len;

+ const unsigned char *p_buf = data;

+

+ /* Allow crc instructions in asm */

+ asm(".cpu generic+crc");

Hm, this switches it for the rest of the function, program, ...?

Greetings,

Andres Freund

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Yuqi Gu (#8)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-03-02 09:42:22 +0000, Yuqi Gu wrote:

Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this code path.

The mini benchmark code:

I'd be more interested in a benchmark using postgres itself...

What's the availability of these headers and functions on non-linux platforms?

This Arm64 optimization code only supports linux os so far.

That's not ok for postgres, unfortunately...

What does it mean "stock autoconf" ?

Should the configure script be made by specific version autoconf ?

Yes, it's the version from the autoconf project, rather than with
modifications by distributions. Don't worry about it, the committer can
take care of that.

- Andres

#10

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Andres Freund (#6)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 02/03/18 06:42, Andres Freund wrote:

On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:

So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?

It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.

Anyone got an idea on how to do that? I googled around a bit, but
couldn't find any examples. All the examples I could find very
Linux-specific, and used getauxval(), except for this in the FreeBSD
kernel itself:
https://github.com/freebsd/freebsd/blob/master/sys/libkern/crc32.c#L775.
I'm no expert on FreeBSD, but that doesn't seem suitable for use in a
user program.

In any case, I reworked this patch to follow the example of the existing
code more closely. Notable changes:

* Use compiler intrinsics instead of inline assembly.

* If the target architecture has them, use the CRC instructions without
a runtime check. You'll get that if you use "CFLAGS=armv8.1-a", for
example, as the CRC Extension was made mandatory in ARM v8.1. This
should work even on FreeBSD or other non-Linux systems, where
getauxval() is not available.

* I removed the loop to handle two uint64's at a time, using the LDP
instruction. I couldn't find a compiler intrinsic for that, and it was
actually slower, at least on the system I have access to, than a
straightforward loop that processes 8 bytes at a time.

* I tested this on Linux, with gcc and clang, on an ARM64 virtual
machine that I had available (not an emulator, but a VM on a shared
ARM64 server).

- Heikki

Attachments:

arm64ce-crc32c-1.patchtext/x-patch; name=arm64ce-crc32c-1.patchDownload

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 689bb7f181..d530cf92c0 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -627,3 +627,34 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
+
+
+# PGAC_ARM64CE_CRC32C_INTRINSICS
+# -----------------------
+# Check if the compiler supports the ARM64CE CRC32C instructions added in XXX
+# using the __crc32cb, __crc32ch, __crc32cw, and __crc32cd intrinsic functions.
+#
+# An optional compiler flag can be passed as argument (e.g. -march=+crc). If the
+# intrinsics are supported, sets pgac_arm64ce_crc32c_intrinsics, and CFLAGS_ARM64CE_CRC32C.
+AC_DEFUN([PGAC_ARM64CE_CRC32C_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm64ce_crc32c_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_acle.h>],
+  [unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_ARM64CE_CRC32C="$1"
+  pgac_arm64ce_crc32c_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM64CE_CRC32C_INTRINSICS
diff --git a/configure b/configure
index 1242e310b4..9b1389df92 100755
--- a/configure
+++ b/configure
@@ -646,6 +646,7 @@ MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
+CFLAGS_ARM64CE_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 HAVE_IPV6
@@ -15509,28 +15510,175 @@ if ac_fn_c_try_compile "$LINENO"; then :
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 
+# Check for ARM64 CRC Extensions intrinsics to do CRC calculations.
+#
+# First check if __crc32c* intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=v8-a+crc flag helps.
+# CFLAGS_ARM64CE_CRC32C is set if that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5
+$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; }
+if ${pgac_cv_arm64ce_crc32c_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_acle.h>
+int
+main ()
+{
+unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm64ce_crc32c_intrinsics_=yes
+else
+  pgac_cv_arm64ce_crc32c_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm64ce_crc32c_intrinsics_" >&5
+$as_echo "$pgac_cv_arm64ce_crc32c_intrinsics_" >&6; }
+if test x"$pgac_cv_arm64ce_crc32c_intrinsics_" = x"yes"; then
+  CFLAGS_ARM64CE_CRC32C=""
+  pgac_arm64ce_crc32c_intrinsics=yes
+fi
+
+if test x"$pgac_arm64ce_crc32c_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc" >&5
+$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc... " >&6; }
+if ${pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crc"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_acle.h>
+int
+main ()
+{
+unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc=yes
+else
+  pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc" >&5
+$as_echo "$pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc" >&6; }
+if test x"$pgac_cv_arm64ce_crc32c_intrinsics__march_armv8_apcrc" = x"yes"; then
+  CFLAGS_ARM64CE_CRC32C="-march=armv8-a+crc"
+  pgac_arm64ce_crc32c_intrinsics=yes
+fi
+
+fi
+
+
+# In order to detect at runtime, if the ARM64 CRC Extension is available,
+# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
+# everything we need for that.
+for ac_func in getauxval
+do :
+  ac_fn_c_check_func "$LINENO" "getauxval" "ac_cv_func_getauxval"
+if test "x$ac_cv_func_getauxval" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_GETAUXVAL 1
+_ACEOF
+
+fi
+done
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+
+int
+main ()
+{
+
+#ifndef AT_HWCAP
+#error AT_HWCAP not defined
+#endif
+#ifndef HWCAP_CRC32
+#error HWCAP_CRC32 not defined
+#endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  HAVE_HWCAP_CRC32=1
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+
 # Select CRC-32C implementation.
 #
-# If we are targeting a processor that has SSE 4.2 instructions, we can use the
-# special CRC instructions for calculating CRC-32C. If we're not targeting such
-# a processor, but we can nevertheless produce code that uses the SSE
-# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
-# select which one to use at runtime, depending on whether SSE 4.2 is supported
-# by the processor we're running on.
+# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
+# use the special CRC instructions for calculating CRC-32C. If we're not
+# targeting such a processor, but we can nevertheless produce code that uses
+# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# implementations and select which one to use at runtime, depending on whether
+# SSE 4.2 is supported by the processor we're running on.
+#
+# Similarly, if we are targeting an ARM processor that has CRC instructions
+# that are part of the CRC Extension, use them, and if we're not targeting
+# such a processor but can nevertheless produce code that uses the CRC
+# instructions, compile both, and select at run time.
 #
 # You can override this logic by setting the appropriate USE_*_CRC32 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SLICING_BY_8_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARM64CE_CRC32C" = x"" && test x"$USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK" = x""; then
+  # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
-    # the CPUID instruction is needed for the runtime check.
+    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+    # the runtime check.
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      # Use ARM64 CRC Extension if available.
+      if test x"$pgac_arm64ce_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARM64CE_CRC32C" = x""; then
+        USE_ARM64CE_CRC32C=1
+      else
+        # ARM64 CRC Extension, with runtime check? The getauxval() function and
+	# HWCAP_CRC32 are needed for the runtime check.
+        if test x"$pgac_arm64ce_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+          USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK=1
+        else
+          # fall back to slicing-by-8 algorithm which doesn't require any special
+          # CPU support.
+          USE_SLICING_BY_8_CRC32C=1
+	fi
+      fi
     fi
   fi
 fi
@@ -15550,16 +15698,34 @@ else
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
   else
+    if test x"$USE_ARM64CE_CRC32C" = x"1"; then
+
+$as_echo "#define USE_ARM64CE_CRC32C 1" >>confdefs.h
+
+      PG_CRC32C_OBJS="pg_crc32c_arm64ce.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARM64 CE" >&5
+$as_echo "ARM64 CE" >&6; }
+    else
+      if test x"$USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+        PG_CRC32C_OBJS="pg_crc32c_arm64ce.o pg_crc32c_sb8.o pg_crc32c_arm64ce_choose.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARM64 CE with runtime check" >&5
+$as_echo "ARM64 CE with runtime check" >&6; }
+      else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+      fi
+    fi
   fi
 fi
 
diff --git a/configure.in b/configure.in
index aee3ab0867..4d2e61b231 100644
--- a/configure.in
+++ b/configure.in
@@ -1901,28 +1901,73 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
 #endif
 ])], [SSE4_2_TARGETED=1])
 
+# Check for ARM64 CRC Extensions intrinsics to do CRC calculations.
+#
+# First check if __crc32c* intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=v8-a+crc flag helps.
+# CFLAGS_ARM64CE_CRC32C is set if that's required.
+PGAC_ARM64CE_CRC32C_INTRINSICS([])
+if test x"$pgac_arm64ce_crc32c_intrinsics" != x"yes"; then
+  PGAC_ARM64CE_CRC32C_INTRINSICS([-march=armv8-a+crc])
+fi
+AC_SUBST(CFLAGS_ARM64CE_CRC32C)
+
+# In order to detect at runtime, if the ARM64 CRC Extension is available,
+# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
+# everything we need for that.
+AC_CHECK_FUNCS([getauxval])
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+], [
+#ifndef AT_HWCAP
+#error AT_HWCAP not defined
+#endif
+#ifndef HWCAP_CRC32
+#error HWCAP_CRC32 not defined
+#endif
+])], [HAVE_HWCAP_CRC32=1])
+
 # Select CRC-32C implementation.
 #
-# If we are targeting a processor that has SSE 4.2 instructions, we can use the
-# special CRC instructions for calculating CRC-32C. If we're not targeting such
-# a processor, but we can nevertheless produce code that uses the SSE
-# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
-# select which one to use at runtime, depending on whether SSE 4.2 is supported
-# by the processor we're running on.
+# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
+# use the special CRC instructions for calculating CRC-32C. If we're not
+# targeting such a processor, but we can nevertheless produce code that uses
+# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# implementations and select which one to use at runtime, depending on whether
+# SSE 4.2 is supported by the processor we're running on.
+#
+# Similarly, if we are targeting an ARM processor that has CRC instructions
+# that are part of the CRC Extension, use them. And if we're not targeting
+# such a processor, but can nevertheless produce code that uses the CRC
+# instructions, compile both, and select at run time.
 #
 # You can override this logic by setting the appropriate USE_*_CRC32 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SLICING_BY_8_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARM64CE_CRC32C" = x"" && test x"$USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK" = x""; then
+  # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
-    # the CPUID instruction is needed for the runtime check.
+    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+    # the runtime check.
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      # Use ARM64 CRC Extension if available.
+      if test x"$pgac_arm64ce_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARM64CE_CRC32C" = x""; then
+        USE_ARM64CE_CRC32C=1
+      else
+        # ARM64 CRC Extension, with runtime check? The getauxval() function and
+	# HWCAP_CRC32 are needed for the runtime check.
+        if test x"$pgac_arm64ce_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+          USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK=1
+        else
+          # fall back to slicing-by-8 algorithm which doesn't require any special
+          # CPU support.
+          USE_SLICING_BY_8_CRC32C=1
+	fi
+      fi
     fi
   fi
 fi
@@ -1936,12 +1981,24 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
     AC_MSG_RESULT(SSE 4.2 with runtime check)
   else
-    AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    AC_MSG_RESULT(slicing-by-8)
+    if test x"$USE_ARM64CE_CRC32C" = x"1"; then
+      AC_DEFINE(USE_ARM64CE_CRC32C, 1, [Define to 1 to use ARM64 CE CRC instructions.])
+      PG_CRC32C_OBJS="pg_crc32c_arm64ce.o"
+      AC_MSG_RESULT(ARM64 CE)
+    else
+      if test x"$USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+        AC_DEFINE(USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM64 CE CRC instructions with a runtime check.])
+        PG_CRC32C_OBJS="pg_crc32c_arm64ce.o pg_crc32c_sb8.o pg_crc32c_arm64ce_choose.o"
+        AC_MSG_RESULT(ARM64 CE with runtime check)
+      else
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
+      fi
+    fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index dcb8dc5d90..1044642bcc 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -250,6 +250,7 @@ SUN_STUDIO_CC = @SUN_STUDIO_CC@
 CFLAGS = @CFLAGS@
 CFLAGS_VECTOR = @CFLAGS_VECTOR@
 CFLAGS_SSE42 = @CFLAGS_SSE42@
+CFLAGS_ARM64CE_CRC32C = @CFLAGS_ARM64CE_CRC32C@
 
 # Kind-of compilers
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f98f773ff0..a683771535 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -814,6 +814,12 @@
 /* Define to 1 if your <sys/time.h> declares `struct tm'. */
 #undef TM_IN_SYS_TIME
 
+/* Define to 1 to use ARM64 CE CRC instructions. */
+#undef USE_ARM64CE_CRC32C
+
+/* Define to 1 to use ARM64 CE CRC instructions with a runtime check. */
+#undef USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ae2701e958..f6f69fba68 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -42,26 +42,42 @@ typedef uint32 pg_crc32c;
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 #if defined(USE_SSE42_CRC32C)
-/* Use SSE4.2 instructions. */
+/* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_ARM64CE_CRC32C)
+/* Use ARM64 CRC Extensions instructions. */
+
+#define COMP_CRC32C(crc, data, len)							\
+	((crc) = pg_comp_crc32c_arm64((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_arm64(pg_crc32c crc, const void *data, size_t len);
+
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK)
+
 /*
- * Use SSE4.2 instructions, but perform a runtime check first to check that
- * they are available.
+ * Use Intel SSE 4.2 or ARM64 instructions, but perform a runtime check first
+ * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 
+#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+#endif
+#ifdef USE_ARM64CE_CRC32C_WITH_RUNTIME_CHECK
+extern pg_crc32c pg_comp_crc32c_arm64(pg_crc32c crc, const void *data, size_t len);
+#endif
+
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index 81f01b25bb..519b8b1a11 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -65,6 +65,10 @@ thread.o: CFLAGS+=$(PTHREAD_CFLAGS)
 pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)
 
+# pg_crc32c_arm64ce.o and its _srv.o version need CFLAGS_ARM64CE_CRC32C
+pg_crc32c_arm64ce.o: CFLAGS+=$(CFLAGS_ARM64CE_CRC32C)
+pg_crc32c_arm64ce_srv.o: CFLAGS+=$(CFLAGS_ARM64CE_CRC32C)
+
 #
 # Server versions of object files
 #
diff --git a/src/port/pg_crc32c_arm64ce.c b/src/port/pg_crc32c_arm64ce.c
new file mode 100644
index 0000000000..bfbeef6dfb
--- /dev/null
+++ b/src/port/pg_crc32c_arm64ce.c
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_arm64ce.c
+ *	  Compute CRC-32C checksum using ARM64 CRC Extension instructions
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_arm64ce.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+
+#include <arm_acle.h>
+
+pg_crc32c
+pg_comp_crc32c_arm64(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	const unsigned char *pend = p + len;
+
+	while (p + 8 <= pend)
+	{
+		crc = __crc32cd(crc, *(uint64 *) p);
+		p += 8;
+	}
+
+	if (p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+
+	if (p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+
+	if (p < pend)
+	{
+		crc = __crc32cb(crc, *p);
+	}
+	return crc;
+}
diff --git a/src/port/pg_crc32c_arm64ce_choose.c b/src/port/pg_crc32c_arm64ce_choose.c
new file mode 100644
index 0000000000..0d45ca726a
--- /dev/null
+++ b/src/port/pg_crc32c_arm64ce_choose.c
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_arm64ce_choose.c
+ *	  Choose which CRC-32C implementation to use, at runtime.
+ *
+ * Use the special CRC instructions introduced in ARMv8 CRC Extension, if
+ * available on the platform we're running on, but fall back to the
+ * slicing-by-8 implementation otherwise.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_arm64ce_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+
+#include "port/pg_crc32c.h"
+
+static bool
+pg_crc32c_arm64ce_available(void)
+{
+	unsigned long auxv = getauxval(AT_HWCAP);
+
+	return (auxv & HWCAP_CRC32) != 0;
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_arm64ce_available())
+		pg_comp_crc32c = pg_comp_crc32c_arm64;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_sse42_choose.c
similarity index 87%
rename from src/port/pg_crc32c_choose.c
rename to src/port/pg_crc32c_sse42_choose.c
index 40bee67b0a..cde38d8dbf 100644
--- a/src/port/pg_crc32c_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -1,10 +1,10 @@
 /*-------------------------------------------------------------------------
  *
- * pg_crc32c_choose.c
+ * pg_crc32c_sse42_choose.c
  *	  Choose which CRC-32C implementation to use, at runtime.
  *
- * Try to the special CRC instructions introduced in Intel SSE 4.2,
- * if available on the platform we're running on, but fall back to the
+ * Use the special CRC instructions introduced in Intel SSE 4.2, if
+ * available on the platform we're running on, but fall back to the
  * slicing-by-8 implementation otherwise.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
@@ -12,7 +12,7 @@
  *
  *
  * IDENTIFICATION
- *	  src/port/pg_crc32c_choose.c
+ *	  src/port/pg_crc32c_sse42_choose.c
  *
  *-------------------------------------------------------------------------
  */

#11

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Heikki Linnakangas (#10)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

On 02/03/18 06:42, Andres Freund wrote:

On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:

So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?

It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.

Anyone got an idea on how to do that? I googled around a bit, but couldn't
find any examples.

Similar...

* Use compiler intrinsics instead of inline assembly.

+many

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).

Have you seen actual postgres performance benefits with the patch?

- Andres

#12

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Andres Freund (#11)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).

Have you seen actual postgres performance benefits with the patch?

I just ran a small test with pg_waldump, similar to what Abhijit
Menon-Sen ran with the Slicing-by-8 and Intel SSE patches, when we added
those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it
took about 7 s, and with the special CPU instructions, about 5 s. 'perf'
showed that the CRC computation took about 30% of the CPU time before,
and about 12% after, which sounds about right. That's not as big a
speedup as we saw with the corresponding Intel SSE instructions back in
2014, but still quite worthwhile.

- Heikki

#13

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Heikki Linnakangas (#12)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).

Have you seen actual postgres performance benefits with the patch?

I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.

Cool. Based on a skim the patch looks reasonable. It's a bit sad that
it's effecively linux specific. But I'm not sure we can do anything
about that atm, given the state of the "discovery" APIs on various
platforms.

Greetings,

Andres Freund

#14

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Andres Freund (#13)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 03/04/18 19:09, Andres Freund wrote:

Hi,

On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).

Have you seen actual postgres performance benefits with the patch?

I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.

Cool. Based on a skim the patch looks reasonable.

Thanks.

I bikeshedded with myself on the naming of things, and decided to use
"ARMv8" in the variable and file names, instead of ARM64 or ARMCE or
ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional
feature), it's not really related to the 64-bitness, even though the
64-bit instruction set was also introduced in ARM v8. Other than that,
and some comment fixes, this is the same as the previous patch version.

I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've
been testing on, but I don't know. And even if it's OK, would it perform
better if we did 1-byte loads in the beginning, until we reach the
8-byte boundary?

- Heikki

Attachments:

armv8-crc32c-2.patchtext/x-patch; name=armv8-crc32c-2.patchDownload

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index b518851441..ba5c40db01 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -667,3 +667,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
+
+
+# PGAC_ARMV8_CRC32C_INTRINSICS
+# -----------------------
+# Check if the compiler supports the CRC32C instructions using the __crc32cb,
+# __crc32ch, __crc32cw, and __crc32cd intrinsic functions. These instructions
+# were first introduced in ARMv8 in the optional CRC Extension, and became
+# mandatory in ARMv8.1.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crc). If the intrinsics are supported, sets
+# pgac_armv8_crc32c_intrinsics, and CFLAGS_ARMV8_CRC32C.
+AC_DEFUN([PGAC_ARMV8_CRC32C_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_crc32c_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_acle.h>],
+  [unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_ARMV8_CRC32C="$1"
+  pgac_armv8_crc32c_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_CRC32C_INTRINSICS
diff --git a/configure b/configure
index 5c56f21282..561b5c419e 100755
--- a/configure
+++ b/configure
@@ -646,6 +646,7 @@ MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
+CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 HAVE_IPV6
@@ -17254,28 +17255,175 @@ if ac_fn_c_try_compile "$LINENO"; then :
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 
+# Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
+#
+# First check if __crc32c* intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crc flag helps.
+# CFLAGS_ARMV8_CRC32C is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5
+$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_crc32c_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_acle.h>
+int
+main ()
+{
+unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crc32c_intrinsics_=yes
+else
+  pgac_cv_armv8_crc32c_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crc32c_intrinsics_" >&5
+$as_echo "$pgac_cv_armv8_crc32c_intrinsics_" >&6; }
+if test x"$pgac_cv_armv8_crc32c_intrinsics_" = x"yes"; then
+  CFLAGS_ARMV8_CRC32C=""
+  pgac_armv8_crc32c_intrinsics=yes
+fi
+
+if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc" >&5
+$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc... " >&6; }
+if ${pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crc"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_acle.h>
+int
+main ()
+{
+unsigned int crc = 0;
+   crc = __crc32cb(crc, 0);
+   crc = __crc32ch(crc, 0);
+   crc = __crc32cw(crc, 0);
+   crc = __crc32cd(crc, 0);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc=yes
+else
+  pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" >&5
+$as_echo "$pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" >&6; }
+if test x"$pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" = x"yes"; then
+  CFLAGS_ARMV8_CRC32C="-march=armv8-a+crc"
+  pgac_armv8_crc32c_intrinsics=yes
+fi
+
+fi
+
+
+# In order to detect at runtime, if the ARM CRC Extension is available,
+# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
+# everything we need for that.
+for ac_func in getauxval
+do :
+  ac_fn_c_check_func "$LINENO" "getauxval" "ac_cv_func_getauxval"
+if test "x$ac_cv_func_getauxval" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_GETAUXVAL 1
+_ACEOF
+
+fi
+done
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+
+int
+main ()
+{
+
+#ifndef AT_HWCAP
+#error AT_HWCAP not defined
+#endif
+#ifndef HWCAP_CRC32
+#error HWCAP_CRC32 not defined
+#endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  HAVE_HWCAP_CRC32=1
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+
 # Select CRC-32C implementation.
 #
-# If we are targeting a processor that has SSE 4.2 instructions, we can use the
-# special CRC instructions for calculating CRC-32C. If we're not targeting such
-# a processor, but we can nevertheless produce code that uses the SSE
-# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
-# select which one to use at runtime, depending on whether SSE 4.2 is supported
-# by the processor we're running on.
+# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
+# use the special CRC instructions for calculating CRC-32C. If we're not
+# targeting such a processor, but we can nevertheless produce code that uses
+# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# implementations and select which one to use at runtime, depending on whether
+# SSE 4.2 is supported by the processor we're running on.
+#
+# Similarly, if we are targeting an ARM processor that has the CRC
+# instructions that are part of the ARMv8 CRC Extension, use them. And if
+# we're not targeting such a processor, but can nevertheless produce code that
+# uses the CRC instructions, compile both, and select at run time.
 #
 # You can override this logic by setting the appropriate USE_*_CRC32 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SLICING_BY_8_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x""; then
+  # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
-    # the CPUID instruction is needed for the runtime check.
+    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+    # the runtime check.
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      # Use ARM CRC Extension if available.
+      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
+        USE_ARMV8_CRC32C=1
+      else
+        # ARM CRC Extension, with runtime check? The getauxval() function and
+	# HWCAP_CRC32 are needed for the runtime check.
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        else
+          # fall back to slicing-by-8 algorithm, which doesn't require any
+          # special CPU support.
+          USE_SLICING_BY_8_CRC32C=1
+	fi
+      fi
     fi
   fi
 fi
@@ -17295,16 +17443,34 @@ else
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
   else
+    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
+
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+$as_echo "ARMv8 CRC instructions" >&6; }
+    else
+      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+$as_echo "ARMv8 CRC instructions with runtime check" >&6; }
+      else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+      fi
+    fi
   fi
 fi
 
diff --git a/configure.in b/configure.in
index 618f9d4579..71d8d7d554 100644
--- a/configure.in
+++ b/configure.in
@@ -2003,28 +2003,73 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
 #endif
 ])], [SSE4_2_TARGETED=1])
 
+# Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
+#
+# First check if __crc32c* intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crc flag helps.
+# CFLAGS_ARMV8_CRC32C is set if the extra flag is required.
+PGAC_ARMV8_CRC32C_INTRINSICS([])
+if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
+  PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
+fi
+AC_SUBST(CFLAGS_ARMV8_CRC32C)
+
+# In order to detect at runtime, if the ARM CRC Extension is available,
+# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
+# everything we need for that.
+AC_CHECK_FUNCS([getauxval])
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+], [
+#ifndef AT_HWCAP
+#error AT_HWCAP not defined
+#endif
+#ifndef HWCAP_CRC32
+#error HWCAP_CRC32 not defined
+#endif
+])], [HAVE_HWCAP_CRC32=1])
+
 # Select CRC-32C implementation.
 #
-# If we are targeting a processor that has SSE 4.2 instructions, we can use the
-# special CRC instructions for calculating CRC-32C. If we're not targeting such
-# a processor, but we can nevertheless produce code that uses the SSE
-# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
-# select which one to use at runtime, depending on whether SSE 4.2 is supported
-# by the processor we're running on.
+# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
+# use the special CRC instructions for calculating CRC-32C. If we're not
+# targeting such a processor, but we can nevertheless produce code that uses
+# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# implementations and select which one to use at runtime, depending on whether
+# SSE 4.2 is supported by the processor we're running on.
+#
+# Similarly, if we are targeting an ARM processor that has the CRC
+# instructions that are part of the ARMv8 CRC Extension, use them. And if
+# we're not targeting such a processor, but can nevertheless produce code that
+# uses the CRC instructions, compile both, and select at run time.
 #
 # You can override this logic by setting the appropriate USE_*_CRC32 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SLICING_BY_8_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x""; then
+  # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
-    # the CPUID instruction is needed for the runtime check.
+    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+    # the runtime check.
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
-      # fall back to slicing-by-8 algorithm which doesn't require any special
-      # CPU support.
-      USE_SLICING_BY_8_CRC32C=1
+      # Use ARM CRC Extension if available.
+      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
+        USE_ARMV8_CRC32C=1
+      else
+        # ARM CRC Extension, with runtime check? The getauxval() function and
+	# HWCAP_CRC32 are needed for the runtime check.
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        else
+          # fall back to slicing-by-8 algorithm, which doesn't require any
+          # special CPU support.
+          USE_SLICING_BY_8_CRC32C=1
+	fi
+      fi
     fi
   fi
 fi
@@ -2038,12 +2083,24 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
     AC_MSG_RESULT(SSE 4.2 with runtime check)
   else
-    AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-    AC_MSG_RESULT(slicing-by-8)
+    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions)
+    else
+      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC instructions with a runtime check.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      else
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
+      fi
+    fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 04cace1017..f503223fc4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -260,6 +260,7 @@ CXX = @CXX@
 CFLAGS = @CFLAGS@
 CFLAGS_VECTOR = @CFLAGS_VECTOR@
 CFLAGS_SSE42 = @CFLAGS_SSE42@
+CFLAGS_ARMV8_CRC32C = @CFLAGS_ARMV8_CRC32C@
 CXXFLAGS = @CXXFLAGS@
 
 LLVM_CPPFLAGS = @LLVM_CPPFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 625cd21ed3..094cd6fa77 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -239,6 +239,9 @@
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
 
+/* Define to 1 if you have the `getauxval' function. */
+#undef HAVE_GETAUXVAL
+
 /* Define to 1 if you have the `gethostbyname_r' function. */
 #undef HAVE_GETHOSTBYNAME_R
 
@@ -842,6 +845,12 @@
 /* Define to 1 if your <sys/time.h> declares `struct tm'. */
 #undef TM_IN_SYS_TIME
 
+/* Define to 1 to use ARMv8 CRC Extension. */
+#undef USE_ARMV8_CRC32C
+
+/* Define to 1 to use ARMv8 CRC instructions with a runtime check. */
+#undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ae2701e958..9b15beb284 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -42,26 +42,42 @@ typedef uint32 pg_crc32c;
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 #if defined(USE_SSE42_CRC32C)
-/* Use SSE4.2 instructions. */
+/* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_ARMV8_CRC32C)
+/* Use ARMv8 CRC Extensions instructions. */
+
+#define COMP_CRC32C(crc, data, len)							\
+	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+
 /*
- * Use SSE4.2 instructions, but perform a runtime check first to check that
- * they are available.
+ * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 
+#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+#endif
+#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
+
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index 81f01b25bb..f328b705e4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -65,6 +65,10 @@ thread.o: CFLAGS+=$(PTHREAD_CFLAGS)
 pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)
 
+# pg_crc32c_armv8.o and its _srv.o version need CFLAGS_ARMV8_CRC32C
+pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
+pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_ARMV8_CRC32C)
+
 #
 # Server versions of object files
 #
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
new file mode 100644
index 0000000000..aa26cde088
--- /dev/null
+++ b/src/port/pg_crc32c_armv8.c
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_armv8.c
+ *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_armv8.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+
+#include <arm_acle.h>
+
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	const unsigned char *pend = p + len;
+
+	while (p + 8 <= pend)
+	{
+		crc = __crc32cd(crc, *(uint64 *) p);
+		p += 8;
+	}
+
+	if (p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+
+	if (p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+
+	if (p < pend)
+	{
+		crc = __crc32cb(crc, *p);
+	}
+	return crc;
+}
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
new file mode 100644
index 0000000000..f21a8243e9
--- /dev/null
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_armv8_choose.c
+ *	  Choose between ARMv8 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports the ARMv8
+ * CRC Extension. If it does, use the special instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * XXX: The glibc-specific getauxval() function, with the HWCAP_CRC32
+ * flag, is used to determine if the CRC Extension is available on the
+ * current platform. Is there a more portable way to determine that?
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_armv8_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+
+#include "port/pg_crc32c.h"
+
+static bool
+pg_crc32c_armv8_available(void)
+{
+	unsigned long auxv = getauxval(AT_HWCAP);
+
+	return (auxv & HWCAP_CRC32) != 0;
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_armv8_available())
+		pg_comp_crc32c = pg_comp_crc32c_armv8;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_sse42_choose.c
similarity index 77%
rename from src/port/pg_crc32c_choose.c
rename to src/port/pg_crc32c_sse42_choose.c
index 40bee67b0a..59b7d1273c 100644
--- a/src/port/pg_crc32c_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -1,18 +1,19 @@
 /*-------------------------------------------------------------------------
  *
- * pg_crc32c_choose.c
- *	  Choose which CRC-32C implementation to use, at runtime.
+ * pg_crc32c_sse42_choose.c
+ *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
  *
- * Try to the special CRC instructions introduced in Intel SSE 4.2,
- * if available on the platform we're running on, but fall back to the
- * slicing-by-8 implementation otherwise.
+ * On first call, checks if the CPU we're running on supports the Intel
+ * SSE 4.2. If it does, use the special SSE instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
- *	  src/port/pg_crc32c_choose.c
+ *	  src/port/pg_crc32c_sse42_choose.c
  *
  *-------------------------------------------------------------------------
  */

#15

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Heikki Linnakangas (#14)

Re: Optimize Arm64 crc32c implementation in Postgresql

Hi,

On 2018-04-03 19:38:42 +0300, Heikki Linnakangas wrote:

I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've been
testing on, but I don't know. And even if it's OK, would it perform better
if we did 1-byte loads in the beginning, until we reach the 8-byte boundary?

Architecture manual time? They're available freely IIRC and should
answer this.

- Andres

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Heikki Linnakangas (#14)

Re: Optimize Arm64 crc32c implementation in Postgresql

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've
been testing on, but I don't know. And even if it's OK, would it perform
better if we did 1-byte loads in the beginning, until we reach the
8-byte boundary?

I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually. Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.

regards, tom lane

#17

Daniel Gustafsson

daniel@yesql.se

almost 8 years ago

In reply to: Heikki Linnakangas (#14)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 03 Apr 2018, at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/04/18 19:09, Andres Freund wrote:

Hi,
On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:

On 01/04/18 20:32, Andres Freund wrote:

On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:

* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).

Have you seen actual postgres performance benefits with the patch?

I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.

Cool. Based on a skim the patch looks reasonable.

Thanks.

I bikeshedded with myself on the naming of things, and decided to use "ARMv8" in the variable and file names, instead of ARM64 or ARMCE or ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional feature), it's not really related to the 64-bitness, even though the 64-bit instruction set was also introduced in ARM v8. Other than that, and some comment fixes, this is the same as the previous patch version.

Noticed two minor documentation issues in a quick skim of the attached patch:

The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32). While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.

AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])

The documentation in configure.in use “runtime” rather than "run time” in all
lines except this one:

+# uses the CRC instructions, compile both, and select at run time.

cheers ./daniel

#18

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Heikki Linnakangas (#14)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Wed, Apr 4, 2018 at 4:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

[armv8-crc32c-2.patch]

This tests OK on my Debian buster aarch64 system (the machine that
runs "eelpout" in the buildfarm), configure tests as expected and I
confirmed that pg_comp_crc32c_armv8 is reached at runtime.

I hope we can figure out a more portable way to detect the
instructions, or failing that a way to detect them on FreeBSD in
userspace. I'll try to send a patch for PG12 if I get a chance.

No opinion on the unaligned memory access question.

--
Thomas Munro
http://www.enterprisedb.com

#19

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Daniel Gustafsson (#17)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 03/04/18 21:56, Daniel Gustafsson wrote:

The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32). While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.

AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])

I pushed that as a separate commit, as that goes back to 9.5. Also, I
noticed that the description of USE_SLICING_BY_8_CRC32C was completely
wrong, fixed that too. Thanks!

- Heikki

#20

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Tom Lane (#16)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 03/04/18 19:43, Andres Freund wrote:

Architecture manual time? They're available freely IIRC and should
answer this.

Yeah. The best reference I could find was "ARM Cortex-A Series
Programmer’s Guide for ARMv8-A"
(http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/ch08s01.html).
In the "Porting to A64" section, it says:

Data and code must be aligned to appropriate boundaries. The
alignment of accesses can affect performance on ARM cores and can
represent a portability problem when moving code from an earlier
architecture to ARMv8-A. It is worth being aware of alignment issues
for performance reasons, or when porting code that makes assumptions
about pointers or 32-bit and 64-bit integer variables.

I was a bit surprised by the "must be aligned to appropriate boundaries"
statement. Googling around, the strict alignment requirement was removed
in ARMv7, and since then, unaligned access works similarly to Intel. I
think there are some special instructions, like atomic ops, that require
alignment though. Perhaps that's what that sentence refers to.

On 03/04/18 20:47, Tom Lane wrote:

I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually. Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.

I did some quick testing, and found that unaligned access is about 2x
slower than aligned. I don't think it's being trapped by the kernel, I
think that would be even slower, but clearly there is an effect there.
So I added code to process the first 1-7 bytes separately, so that the
main loop runs on 8-byte aligned addresses.

Pushed, thanks everyone!

- Heikki

#21

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Heikki Linnakangas (#20)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Pushed, thanks everyone!

On eelpout two test_decoding tests failed:

test ddl ... FAILED
test rewrite ... FAILED

The output has several cases where pg_logical_slot_get_changes() is
invoked and then fails like this:

SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR: incorrect resource manager data checksum in record at 0/1BD6600

Not sure why it would break when called by pg_logical_slot_get_changes()...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2018-04-04%2009%3A58%3A56

--
Thomas Munro
http://www.enterprisedb.com

#22

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Thomas Munro (#21)

Re: Optimize Arm64 crc32c implementation in Postgresql

On 04/04/18 14:13, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 9:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Pushed, thanks everyone!

On eelpout two test_decoding tests failed:

test ddl ... FAILED
test rewrite ... FAILED

The output has several cases where pg_logical_slot_get_changes() is
invoked and then fails like this:

SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR: incorrect resource manager data checksum in record at 0/1BD6600

Not sure why it would break when called by pg_logical_slot_get_changes()...

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2018-04-04%2009%3A58%3A56

Yep. I got the code before the main loop, to handle the first 1-7
unaligned bytes, wrong. Apparently those are the only tests that call
the CRC function with very short and unaligned input.

I've got a fix ready, and I'm running "make check-world" on my ARM box
now, to make sure there aren't any more failures. I'll push the fix once
that's finished. Should've run the full test suite before pushing in the
first place..

- Heikki

#23

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Heikki Linnakangas (#22)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Wed, Apr 4, 2018 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yep. I got the code before the main loop, to handle the first 1-7 unaligned
bytes, wrong. Apparently those are the only tests that call the CRC function
with very short and unaligned input.

BTW I did some googling just now and found out that some libraries use
a technique they call "CPU probing": just try it and see if you get
SIGILL. Is that a bad idea for some reason? Here is a quick hack --
anyone got an ARM system without crc that they could test it on?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

cpu-probe-hack.patchapplication/octet-stream; name=cpu-probe-hack.patchDownload

diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index f21a8243e9..09f909a298 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -8,10 +8,6 @@
  * computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
- * XXX: The glibc-specific getauxval() function, with the HWCAP_CRC32
- * flag, is used to determine if the CRC Extension is available on the
- * current platform. Is there a more portable way to determine that?
- *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -24,17 +20,27 @@
 
 #include "c.h"
 
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-
+#include "libpq/pqsignal.h"
 #include "port/pg_crc32c.h"
 
+static volatile sig_atomic_t illegal_instruction_detected;
+pg_crc32c pg_crc32c_armv8_choose_dummy;
+
+static void
+illegal_instruction_handler(int signo)
+{
+	illegal_instruction_detected = 1;
+}
+
 static bool
 pg_crc32c_armv8_available(void)
 {
-	unsigned long auxv = getauxval(AT_HWCAP);
+	illegal_instruction_detected = 0;
+	pqsignal(SIGILL, illegal_instruction_handler);
+	pg_crc32c_armv8_choose_dummy = pg_comp_crc32c_armv8(0, 0, 0);
+	pqsignal(SIGILL, SIG_DFL);
 
-	return (auxv & HWCAP_CRC32) != 0;
+	return illegal_instruction_detected == 0;
 }
 
 /*

#24

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Thomas Munro (#23)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Wed, Apr 4, 2018 at 11:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

BTW I did some googling just now and found out that some libraries use
a technique they call "CPU probing": just try it and see if you get
SIGILL. Is that a bad idea for some reason? Here is a quick hack --
anyone got an ARM system without crc that they could test it on?

Hmm. I suppose there must be some horrendous non-portable 'step over
the instruction' bit missing in the signal handler. Probably not a
great idea.

--
Thomas Munro
http://www.enterprisedb.com

#25

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Thomas Munro (#24)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, Apr 5, 2018 at 12:00 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

... trying to figure out how to detect the instruction portably using SIGILL ...

Ahh, OpenSSL's armcap.c shows how to do this. You need to
siglongjmp() out of there. Here's a patch that does it that way.
Isn't this better?

I tested this on a Linux ARM system that has the instruction, and I
put a kill(getpid(), SIGILL) in there to test the negative case
because I don't have access to an ARM system without the instruction.
I don't have a FreeBSD/ARM system to test on either but I checked that
the flow control technique works fine on FreeBSD on another
architecture when it hits an instruction it doesn't support.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Use-a-portable-way-to-detect-ARMv8-CRC32-hardware.patchapplication/octet-stream; name=0001-Use-a-portable-way-to-detect-ARMv8-CRC32-hardware.patchDownload

From 55d007c8d6572b9e9f7b7037a28712ae81c76019 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 2 May 2018 21:48:40 +1200
Subject: [PATCH] Use a portable way to detect ARMv8 CRC32 hardware.

Commit f044d71e introduced support for ARMv8 CRC32 compiler intrinsics, but
used a non-portable Linux-only interface to detect the capability at runtime.
Switch to a simple probing scheme where we try it and see if SIGILL is raised.

Thomas Munro
Discussion: https://postgr.es/m/CAEepm%3D02Run-Pk3xyt%2BRV3p1N%2B7cKZxN95_MamaJw8Cnw%2BDwjQ%40mail.gmail.com
---
 configure                         | 45 ++-----------------------------
 configure.in                      | 21 ++-------------
 src/port/pg_crc32c_armv8_choose.c | 34 ++++++++++++++++-------
 3 files changed, 29 insertions(+), 71 deletions(-)

diff --git a/configure b/configure
index 56f18dfbc26..0aafd9c8c06 100755
--- a/configure
+++ b/configure
@@ -17344,46 +17344,6 @@ fi
 fi
 
 
-# In order to detect at runtime, if the ARM CRC Extension is available,
-# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
-# everything we need for that.
-for ac_func in getauxval
-do :
-  ac_fn_c_check_func "$LINENO" "getauxval" "ac_cv_func_getauxval"
-if test "x$ac_cv_func_getauxval" = xyes; then :
-  cat >>confdefs.h <<_ACEOF
-#define HAVE_GETAUXVAL 1
-_ACEOF
-
-fi
-done
-
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-
-int
-main ()
-{
-
-#ifndef AT_HWCAP
-#error AT_HWCAP not defined
-#endif
-#ifndef HWCAP_CRC32
-#error HWCAP_CRC32 not defined
-#endif
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  HAVE_HWCAP_CRC32=1
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -17414,9 +17374,8 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check? The getauxval() function and
-        # HWCAP_CRC32 are needed for the runtime check.
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+        # ARM CRC Extension, with runtime check?
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
           USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
         else
           # fall back to slicing-by-8 algorithm, which doesn't require any
diff --git a/configure.in b/configure.in
index da02a56ec66..c3841d5b07f 100644
--- a/configure.in
+++ b/configure.in
@@ -2014,22 +2014,6 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
 fi
 AC_SUBST(CFLAGS_ARMV8_CRC32C)
 
-# In order to detect at runtime, if the ARM CRC Extension is available,
-# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
-# everything we need for that.
-AC_CHECK_FUNCS([getauxval])
-AC_COMPILE_IFELSE([AC_LANG_PROGRAM([
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-], [
-#ifndef AT_HWCAP
-#error AT_HWCAP not defined
-#endif
-#ifndef HWCAP_CRC32
-#error HWCAP_CRC32 not defined
-#endif
-])], [HAVE_HWCAP_CRC32=1])
-
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2060,9 +2044,8 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check? The getauxval() function and
-        # HWCAP_CRC32 are needed for the runtime check.
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+        # ARM CRC Extension, with runtime check?
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
           USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
         else
           # fall back to slicing-by-8 algorithm, which doesn't require any
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index f21a8243e9a..55896246548 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -8,10 +8,6 @@
  * computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
- * XXX: The glibc-specific getauxval() function, with the HWCAP_CRC32
- * flag, is used to determine if the CRC Extension is available on the
- * current platform. Is there a more portable way to determine that?
- *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -24,17 +20,37 @@
 
 #include "c.h"
 
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-
+#include "libpq/pqsignal.h"
 #include "port/pg_crc32c.h"
 
+#include <setjmp.h>
+
+pg_crc32c pg_crc32c_armv8_choose_dummy;
+
+static sigjmp_buf illegal_instruction_jump;
+
+static void
+illegal_instruction_handler(int signo)
+{
+	siglongjmp(illegal_instruction_jump, 1);
+}
+
 static bool
 pg_crc32c_armv8_available(void)
 {
-	unsigned long auxv = getauxval(AT_HWCAP);
+	bool	result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		pg_crc32c_armv8_choose_dummy = pg_comp_crc32c_armv8(0, 0, 0);
+		result = true;
+	}
+	else
+		result = false;
+	pqsignal(SIGILL, SIG_DFL);
 
-	return (auxv & HWCAP_CRC32) != 0;
+	return result;
 }
 
 /*
-- 
2.17.0

#26

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Thomas Munro (#25)

Re: Optimize Arm64 crc32c implementation in Postgresql

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Ahh, OpenSSL's armcap.c shows how to do this. You need to
siglongjmp() out of there. Here's a patch that does it that way.
Isn't this better?

Do you really need the pg_crc32c_armv8_choose_dummy global variable?
That seems pretty ugly. If you're concerned about the compiler
optimizing away the call to the crc function, you could write it like

result = (pg_comp_crc32c_armv8(0, 0, 0) == expected-value);

which'd provide a bit of extra checking that the code's not broken,
too.

regards, tom lane

#27

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Tom Lane (#26)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, May 3, 2018 at 2:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do you really need the pg_crc32c_armv8_choose_dummy global variable?
That seems pretty ugly. If you're concerned about the compiler
optimizing away the call to the crc function, you could write it like

result = (pg_comp_crc32c_armv8(0, 0, 0) == expected-value);

which'd provide a bit of extra checking that the code's not broken,
too.

True. Of course I needed an interesting length too...

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Use-a-portable-way-to-detect-ARMv8-CRC32-hardware-v2.patchapplication/octet-stream; name=0001-Use-a-portable-way-to-detect-ARMv8-CRC32-hardware-v2.patchDownload

From 15f51f7fcd6a65eeaea27e49cd70cd75bde6e35d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 2 May 2018 21:48:40 +1200
Subject: [PATCH] Use a portable way to detect ARMv8 CRC32 hardware.

Commit f044d71e introduced support for ARMv8 CRC32 compiler intrinsics, but
used a non-portable Linux-only interface to detect the capability at runtime.
Switch to a simple probing scheme where we try it and see if SIGILL is raised.

Thomas Munro
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D02Run-Pk3xyt%2BRV3p1N%2B7cKZxN95_MamaJw8Cnw%2BDwjQ%40mail.gmail.com
---
 configure                         | 45 ++-----------------------------
 configure.in                      | 21 ++-------------
 src/port/pg_crc32c_armv8_choose.c | 30 ++++++++++++++-------
 3 files changed, 25 insertions(+), 71 deletions(-)

diff --git a/configure b/configure
index 56f18dfbc26..0aafd9c8c06 100755
--- a/configure
+++ b/configure
@@ -17344,46 +17344,6 @@ fi
 fi
 
 
-# In order to detect at runtime, if the ARM CRC Extension is available,
-# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
-# everything we need for that.
-for ac_func in getauxval
-do :
-  ac_fn_c_check_func "$LINENO" "getauxval" "ac_cv_func_getauxval"
-if test "x$ac_cv_func_getauxval" = xyes; then :
-  cat >>confdefs.h <<_ACEOF
-#define HAVE_GETAUXVAL 1
-_ACEOF
-
-fi
-done
-
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-
-int
-main ()
-{
-
-#ifndef AT_HWCAP
-#error AT_HWCAP not defined
-#endif
-#ifndef HWCAP_CRC32
-#error HWCAP_CRC32 not defined
-#endif
-
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
-  HAVE_HWCAP_CRC32=1
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -17414,9 +17374,8 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check? The getauxval() function and
-        # HWCAP_CRC32 are needed for the runtime check.
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+        # ARM CRC Extension, with runtime check?
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
           USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
         else
           # fall back to slicing-by-8 algorithm, which doesn't require any
diff --git a/configure.in b/configure.in
index da02a56ec66..c3841d5b07f 100644
--- a/configure.in
+++ b/configure.in
@@ -2014,22 +2014,6 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
 fi
 AC_SUBST(CFLAGS_ARMV8_CRC32C)
 
-# In order to detect at runtime, if the ARM CRC Extension is available,
-# we will do "getauxval(AT_HWCAP) & HWCAP_CRC32". Check if we have
-# everything we need for that.
-AC_CHECK_FUNCS([getauxval])
-AC_COMPILE_IFELSE([AC_LANG_PROGRAM([
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-], [
-#ifndef AT_HWCAP
-#error AT_HWCAP not defined
-#endif
-#ifndef HWCAP_CRC32
-#error HWCAP_CRC32 not defined
-#endif
-])], [HAVE_HWCAP_CRC32=1])
-
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2060,9 +2044,8 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_ARMV8_CRC32C" = x""; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check? The getauxval() function and
-        # HWCAP_CRC32 are needed for the runtime check.
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$ac_cv_func_getauxval" = x"yes" && test x"$HAVE_HWCAP_CRC32" = x"1"; then
+        # ARM CRC Extension, with runtime check?
+        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
           USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
         else
           # fall back to slicing-by-8 algorithm, which doesn't require any
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index f21a8243e9a..de775dab3e5 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -8,10 +8,6 @@
  * computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
- * XXX: The glibc-specific getauxval() function, with the HWCAP_CRC32
- * flag, is used to determine if the CRC Extension is available on the
- * current platform. Is there a more portable way to determine that?
- *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -24,17 +20,33 @@
 
 #include "c.h"
 
-#include <sys/auxv.h>
-#include <asm/hwcap.h>
-
+#include "libpq/pqsignal.h"
 #include "port/pg_crc32c.h"
 
+#include <setjmp.h>
+
+static sigjmp_buf illegal_instruction_jump;
+
+static void
+illegal_instruction_handler(int signo)
+{
+	siglongjmp(illegal_instruction_jump, 1);
+}
+
 static bool
 pg_crc32c_armv8_available(void)
 {
-	unsigned long auxv = getauxval(AT_HWCAP);
+	uint64	data = 42;
+	bool	result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+		result = (pg_comp_crc32c_armv8(0, &data, sizeof(data)) == 0xdd439b0d);
+	else
+		result = false;
+	pqsignal(SIGILL, SIG_DFL);
 
-	return (auxv & HWCAP_CRC32) != 0;
+	return result;
 }
 
 /*
-- 
2.17.0

#28

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Thomas Munro (#27)

Re: Optimize Arm64 crc32c implementation in Postgresql

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, May 3, 2018 at 2:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do you really need the pg_crc32c_armv8_choose_dummy global variable?

True. Of course I needed an interesting length too...

I don't have any way to test this, but it looks plausible, so pushed.

(I note you forgot to run autoheader, btw.)

regards, tom lane

#29

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Tom Lane (#28)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, May 3, 2018 at 10:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't have any way to test this, but it looks plausible, so pushed.

(I note you forgot to run autoheader, btw.)

Thanks!

--
Thomas Munro
http://www.enterprisedb.com

#30

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Thomas Munro (#27)

Re: Optimize Arm64 crc32c implementation in Postgresql

"Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:

+ uint64 data = 42;

Isn't there a hidden assumption about endianness there?

--
Andrew (irc:RhodiumToad)

#31

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrew Gierth (#30)

Re: Optimize Arm64 crc32c implementation in Postgresql

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:
+ uint64 data = 42;

Isn't there a hidden assumption about endianness there?

Yeah, I was wondering about that too, but does anyone actually run
ARMs big-endian?

regards, tom lane

#32

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Tom Lane (#31)

Re: Optimize Arm64 crc32c implementation in Postgresql

I wrote:

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Isn't there a hidden assumption about endianness there?

Yeah, I was wondering about that too, but does anyone actually run
ARMs big-endian?

After a bit more thought ... we could remove the magic constant,
along with any assumptions about endianness, by doing it like this:

result = (pg_comp_crc32c_armv8(0, &data, sizeof(data)) ==
pg_comp_crc32c_sb8(0, &data, sizeof(data)));

Of course we'd eat a few more cycles during the test, but it's hard
to see that mattering.

It strikes me also that, at least for debugging purposes, it's seriously
awful that you can't tell from outside what result this function got.
Certainly the outcome that "pg_comp_crc32c_armv8 executed but returned
the wrong answer" is not something that ought to go unremarked.
We could elog the results, but I'm not very sure what log level is
appropriate --- also, I doubt we want another log entry from every
launched process, so how to prevent that?

regards, tom lane

#33

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Tom Lane (#32)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, May 3, 2018 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Isn't there a hidden assumption about endianness there?

Right, thanks.

Yeah, I was wondering about that too, but does anyone actually run
ARMs big-endian?

I think all the distros I follow dropped big endian support in recent
years, but that's no excuse.

After a bit more thought ... we could remove the magic constant,
along with any assumptions about endianness, by doing it like this:

result = (pg_comp_crc32c_armv8(0, &data, sizeof(data)) ==
pg_comp_crc32c_sb8(0, &data, sizeof(data)));

Of course we'd eat a few more cycles during the test, but it's hard
to see that mattering.

I was thinking of proposing this:

-       uint64          data = 42;
+       uint64          data = UINT64CONST(0x4242424242424242);

... and:

-               result = (pg_comp_crc32c_armv8(0, &data, sizeof(data))
== 0xdd439b0d);
+               result = (pg_comp_crc32c_armv8(0, &data, sizeof(data))
== 0x54eb145b);

No strong preference though.

It strikes me also that, at least for debugging purposes, it's seriously
awful that you can't tell from outside what result this function got.
Certainly the outcome that "pg_comp_crc32c_armv8 executed but returned
the wrong answer" is not something that ought to go unremarked.
We could elog the results, but I'm not very sure what log level is
appropriate --- also, I doubt we want another log entry from every
launched process, so how to prevent that?

I don't think *broken* CPUs are something we need to handle, are they?
Hmm, I suppose there might be a hypothetical operating system or ARM
chip that somehow doesn't raise SIGILL and returns garbage, and then
it's nice to fall back to the software version (unless you're 1/2^32
unlucky).

For debugging I was just putting a temporary elog in. Leaving one in
there at one of the DEBUG levels seems reasonable though.

--
Thomas Munro
http://www.enterprisedb.com

#34

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Thomas Munro (#33)

Re: Optimize Arm64 crc32c implementation in Postgresql

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, May 3, 2018 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It strikes me also that, at least for debugging purposes, it's seriously
awful that you can't tell from outside what result this function got.

I don't think *broken* CPUs are something we need to handle, are they?

I'm not worried so much about broken hardware as about scenarios like
"Munro got the magic constant wrong and nobody ever noticed", or more
likely "somebody broke it later and we didn't notice". We absolutely
do not expect the code path with function-returns-the-wrong-answer to be
taken, and I think it would be appropriate to complain loudly if it is.

regards, tom lane

#35

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Tom Lane (#34)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, May 3, 2018 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, May 3, 2018 at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It strikes me also that, at least for debugging purposes, it's seriously
awful that you can't tell from outside what result this function got.

I don't think *broken* CPUs are something we need to handle, are they?

I'm not worried so much about broken hardware as about scenarios like
"Munro got the magic constant wrong and nobody ever noticed", or more
likely "somebody broke it later and we didn't notice". We absolutely
do not expect the code path with function-returns-the-wrong-answer to be
taken, and I think it would be appropriate to complain loudly if it is.

Ok. Here is a patch that compares hw and sw results and calls
elog(ERROR) if they don't match. It also does elog(DEBUG1) with its
result just before returning.

Here's what I see at startup on my ARMv8 machine when I set
log_min_messages = debug1 in my .conf (it's the very first line
emitted):

2018-05-03 05:07:25.904 UTC [19677] DEBUG: using armv8 crc2 hardware = 1

Here's what I see if I hack the _armv8() function to do kill(getpid(), SIGILL):

2018-05-03 05:09:47.012 UTC [21079] DEBUG: using armv8 crc2 hardware = 0

Here's what I see if I hack the _armv8() function to add 1 to its result:

2018-05-03 05:11:07.366 UTC [22218] FATAL: crc32 hardware and
software results disagree
2018-05-03 05:11:07.367 UTC [22218] LOG: database system is shut down

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Fix-endianness-bug-in-ARMv8-CRC32-detection.patchapplication/octet-stream; name=0001-Fix-endianness-bug-in-ARMv8-CRC32-detection.patchDownload

From aa0ebc7e281e25ae5f2b9795c9f26ed79f79690b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 3 May 2018 05:12:12 +0000
Subject: [PATCH] Fix endianness bug in ARMv8 CRC32 detection.

Andrew Gierth pointed out that commit 1c72ec6f including coding that wouldn't
work correctly on a big endian system.  Fix by simply comparing the hw and sw
implementations' results instead of hardcoding constants.

While here, also log the resulting decision at debug1, and error out if the
hw and sw results unexpectedly differ.

Thomas Munro, based on complaints from Andrew Gierth and Tom Lane
Discussion: https://postgr.es/m/HE1PR0801MB1323D171938EABC04FFE7FA9E3110@HE1PR0801MB1323.eurprd08.prod.outlook.com
---
 src/port/pg_crc32c_armv8_choose.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index d0d3a3da78e..062058bb57d 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -24,6 +24,7 @@

 #include "libpq/pqsignal.h"
 #include "port/pg_crc32c.h"
+#include "utils/elog.h"

 static sigjmp_buf illegal_instruction_jump;
@@ -46,11 +47,18 @@ pg_crc32c_armv8_available(void)

 	pqsignal(SIGILL, illegal_instruction_handler);
 	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
-		result = (pg_comp_crc32c_armv8(0, &data, sizeof(data)) == 0xdd439b0d);
+	{
+		if (pg_comp_crc32c_armv8(0, &data, sizeof(data)) !=
+			pg_comp_crc32c_sb8(0, &data, sizeof(data)))
+			elog(ERROR, "crc32 hardware and software results disagree");
+		result = true;
+	}
 	else
 		result = false;
 	pqsignal(SIGILL, SIG_DFL);

+	elog(DEBUG1, "using armv8 crc2 hardware = %d", result);
+
 	return result;
 }

-- 
2.17.0

#36

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Thomas Munro (#35)

1 attachment(s)

Re: Optimize Arm64 crc32c implementation in Postgresql

On Thu, May 3, 2018 at 5:18 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

2018-05-03 05:07:25.904 UTC [19677] DEBUG: using armv8 crc2 hardware = 1

Let me try that again with that stupid typo (crc2) fixed...

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Fix-endianness-bug-in-ARMv8-CRC32-detection-v2.patchapplication/octet-stream; name=0001-Fix-endianness-bug-in-ARMv8-CRC32-detection-v2.patchDownload

From 90f807317ced813163fa2e5bf16341b6400448e2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Thu, 3 May 2018 05:12:12 +0000
Subject: [PATCH] Fix endianness bug in ARMv8 CRC32 detection.

Andrew Gierth pointed out that commit 1c72ec6f included coding that wouldn't
work correctly on a big endian system.  Fix by simply comparing the hardware
and software implementations' results instead of hardcoding a constant value.

While here, also log the resulting decision at debug1, and error out if the
hw and sw results unexpectedly differ.

Thomas Munro, based on complaints from Andrew Gierth and Tom Lane
Discussion: https://postgr.es/m/HE1PR0801MB1323D171938EABC04FFE7FA9E3110@HE1PR0801MB1323.eurprd08.prod.outlook.com
---
 src/port/pg_crc32c_armv8_choose.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index d0d3a3da78e..802136c6954 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -24,6 +24,7 @@

 #include "libpq/pqsignal.h"
 #include "port/pg_crc32c.h"
+#include "utils/elog.h"

 static sigjmp_buf illegal_instruction_jump;
@@ -46,11 +47,18 @@ pg_crc32c_armv8_available(void)

 	pqsignal(SIGILL, illegal_instruction_handler);
 	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
-		result = (pg_comp_crc32c_armv8(0, &data, sizeof(data)) == 0xdd439b0d);
+	{
+		if (pg_comp_crc32c_armv8(0, &data, sizeof(data)) !=
+			pg_comp_crc32c_sb8(0, &data, sizeof(data)))
+			elog(ERROR, "crc32 hardware and software results disagree");
+		result = true;
+	}
 	else
 		result = false;
 	pqsignal(SIGILL, SIG_DFL);

+	elog(DEBUG1, "using armv8 crc32 hardware = %d", result);
+
 	return result;
 }

-- 
2.17.0

#37

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Thomas Munro (#36)

Re: Optimize Arm64 crc32c implementation in Postgresql

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Let me try that again with that stupid typo (crc2) fixed...

I didn't like that too much as-is, because it was capable of calling
elog(ERROR) without having reset the SIGILL trap first. That's just
trouble waiting to happen, so I rearranged to avoid it.

I also noticed that we'd been sloppy about making the file safe to
compile for both frontend and backend, so I cleaned that up.

Also, I had thought that maybe the postmaster should do something to
ensure that it sets up the function pointer, so that child processes
inherit the correct pointer via fork() and don't need to repeat the
test (and then possibly spam the postmaster log). On closer inspection,
no new code is needed because ReadControlFile runs a CRC check, but
I felt it was worth documenting that.

regards, tom lane

#38

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Tom Lane (#37)

Re: Optimize Arm64 crc32c implementation in Postgresql

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> I also noticed that we'd been sloppy about making the file safe to
Tom> compile for both frontend and backend, so I cleaned that up.

In a frontend, wouldn't it be more kosher to restore the previous SIGILL
handler rather than unconditionally reset it to SIG_DFL?

--
Andrew (irc:RhodiumToad)

#39

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrew Gierth (#38)

Re: Optimize Arm64 crc32c implementation in Postgresql

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> I also noticed that we'd been sloppy about making the file safe to
Tom> compile for both frontend and backend, so I cleaned that up.

In a frontend, wouldn't it be more kosher to restore the previous SIGILL
handler rather than unconditionally reset it to SIG_DFL?

If we had any other code that was setting the SIGILL trap, I might
worry about that, but we don't.

The whole thing is really a bit questionable to run in arbitrary
environments -- for instance, it'd be pretty unsafe inside a threaded
application. So if we had code in libpq or ecpg that computed CRCs,
I'd be worrying about this approach quite a bit more. But it seems all
right for current and foreseen uses.

regards, tom lane