CRC32C Parallel Computation Optimization on ARM

Started by Xiang Gaoabout 2 years ago43 messages

Xiang.Gao@arm.com

about 2 years ago

1 attachment(s)

Hi all

This patch uses a parallel computing optimization algorithm to improve crc32c computing performance on ARM. The algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three equal-sized blocks.Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0001-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0001-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 5fb72a44d29b1ea3d7f8475aaa45714db7c0aa67 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              | 25 ++++++++
 configure                         | 59 ++++++++++++++++++-
 configure.ac                      | 22 +++++++-
 src/include/pg_config.h.in        |  3 +
 src/include/port/pg_crc32c.h      | 19 ++++---
 src/port/pg_crc32c_armv8.c        | 94 +++++++++++++++++++++++++++++++
 src/port/pg_crc32c_armv8_choose.c | 49 +++++++++++++++-
 7 files changed, 259 insertions(+), 12 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..483d4724d1 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,31 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index d47e0f8b26..b7f60cae87 100755
--- a/configure
+++ b/configure
@@ -18033,6 +18033,44 @@ fi
 
 
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -18084,6 +18122,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18107,7 +18152,7 @@ $as_echo "SSE 4.2 with runtime check" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
     else
@@ -18140,6 +18185,18 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM VMULL intrinsic" >&5
+$as_echo_n "checking whether to use ARM VMULL intrinsic... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index 440b08d113..de33e326a2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2105,6 +2105,12 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2156,6 +2162,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2170,7 +2183,7 @@ else
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
       AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       AC_MSG_RESULT(ARMv8 CRC instructions)
     else
       if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
@@ -2193,6 +2206,13 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM VMULL intrinsic])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension.])
+  AC_MSG_RESULT(yes)
+else
+  AC_MSG_RESULT(no)
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..65cd43e156 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 VMULL Extension. */
+#undef USE_ARMV8_VMULL
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..35eb689a3b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,14 +49,20 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_ARMV8_CRC32C)
+#elif defined(USE_ARMV8_CRC32C) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
-	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+
+#if defined(USE_ARMV8_VMULL)
+#include<arm_neon.h>
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -67,10 +73,10 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * Use Intel SSE 4.2 instructions, but perform a runtime check first
  * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
@@ -83,9 +89,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 #endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..672a4e417b 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -18,6 +19,99 @@
 
 #include "port/pg_crc32c.h"
 
+#if defined(USE_ARMV8_VMULL)
+#include <arm_neon.h>
+__attribute__((target("+crypto")))
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	const unsigned char *pend = p + len;
+
+	/*
+	 * ARMv8 doesn't require alignment, but aligned memory access is
+	 * significantly faster. Process leading bytes so that the loop below
+	 * starts with a pointer aligned to eight bytes.
+	 */
+	if (!PointerIsAligned(p, uint16) &&
+		p + 1 <= pend)
+	{
+		crc = __crc32cb(crc, *p);
+		p += 1;
+	}
+	if (!PointerIsAligned(p, uint32) &&
+		p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+	if (!PointerIsAligned(p, uint64) &&
+		p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+	while (p + 1024 <= pend)
+	{
+#define BLOCK_LEN 42
+		const uint64_t *in64 = (const uint64_t *) (p);
+		uint32_t	crc0 = crc,
+					crc1 = 0,
+					crc2 = 0;
+
+		for (int i = 0; i < BLOCK_LEN; i++, in64++)
+		{
+			crc0 = __crc32cd(crc0, *(in64));
+			crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+			crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+		}
+		in64 += BLOCK_LEN * 2;
+		crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+		crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+		crc = crc0 ^ crc1 ^ crc2;
+
+		crc = __crc32cd(crc, *in64++);
+		crc = __crc32cd(crc, *in64++);
+
+		p += 1024;
+#undef BLOCK_LEN
+	}
+
+	/* Process eight bytes at a time, as far as we can. */
+	while (p + 8 <= pend)
+	{
+		crc = __crc32cd(crc, *(uint64 *) p);
+		p += 8;
+	}
+
+	/* Process remaining 0-7 bytes. */
+	if (p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+	if (p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+	if (p < pend)
+	{
+		crc = __crc32cb(crc, *p);
+	}
+
+	return crc;
+}
+#endif
+
 pg_crc32c
 pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 {
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..2a3b8ba907 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -77,6 +77,36 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+#if defined(USE_ARMV8_VMULL)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +115,24 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Xiang Gao (#1)

Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:

This patch uses a parallel computing optimization algorithm to
improve crc32c computing performance on ARM. The algorithm comes
from Intel whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting. Could you attached to this thread the test files you
used and the results obtained please? If this data gets deleted from
github, then it would not be possible to refer back to what you did at
the related benchmark results.

Note that your patch is forgetting about meson; it just patches
./configure.
--
Michael

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Michael Paquier (#2)

Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote:

On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:

This patch uses a parallel computing optimization algorithm to
improve crc32c computing performance on ARM. The algorithm comes
from Intel whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting. Could you attached to this thread the test files you
used and the results obtained please? If this data gets deleted from
github, then it would not be possible to refer back to what you did at
the related benchmark results.

Note that your patch is forgetting about meson; it just patches
./configure.

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]/messages/by-id/ec487192-f6aa-509a-cacb-6642dad14209@iki.fi. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.

[0]: /messages/by-id/ec487192-f6aa-509a-cacb-6642dad14209@iki.fi

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Nathan Bossart (#3)

Re: CRC32C Parallel Computation Optimization on ARM

On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.

Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Nathan Bossart (#4)

Re: CRC32C Parallel Computation Optimization on ARM

On 25/10/2023 00:18, Nathan Bossart wrote:

On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.

Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*

No need to guess, pg_waldump -z will tell you what the record size is.
And you can vary it by changing the checkpoint interval and/or pgbench
scale factor: if you checkpoint frequently or if the database is larger,
you get more full-page images which makes the records larger on average,
and vice versa.

--
Heikki Linnakangas
Neon (https://neon.tech)

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Heikki Linnakangas (#5)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote:

On 25/10/2023 00:18, Nathan Bossart wrote:

Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*

No need to guess, pg_waldump -z will tell you what the record size is. And
you can vary it by changing the checkpoint interval and/or pgbench scale
factor: if you checkpoint frequently or if the database is larger, you get
more full-page images which makes the records larger on average, and vice
versa.

If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records? At least it would
make the results more predictible.
--
Michael

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Michael Paquier (#6)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote:

If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records? At least it would
make the results more predictible.

I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Given the performance characteristics and relative simplicity of the patch,
I think this could be worth doing. I suspect we'll want to do something
similar for x86, too.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Michael Paquier (#2)

3 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

Thanks for your suggestion, this is the modified patch and two test files.

-----Original Message-----
From: Michael Paquier <michael@paquier.xyz>
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao <Xiang.Gao@arm.com>
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:

This patch uses a parallel computing optimization algorithm to improve
crc32c computing performance on ARM. The algorithm comes from Intel
whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytes

Crc32c unitest:
https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark:
https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting. Could you attached to this thread the test files you used and the results obtained please? If this data gets deleted from github, then it would not be possible to refer back to what you did at the related benchmark results.

Note that your patch is forgetting about meson; it just patches ./configure.
--
Michael
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0002-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0002-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 61be097b9484204f0c6c5af64e6767e0b42649e1 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              | 25 ++++++++
 configure                         | 59 ++++++++++++++++++-
 configure.ac                      | 22 +++++++-
 meson.build                       | 24 ++++++++
 src/include/pg_config.h.in        |  3 +
 src/include/port/pg_crc32c.h      | 19 ++++---
 src/port/meson.build              |  2 +
 src/port/pg_crc32c_armv8.c        | 94 +++++++++++++++++++++++++++++++
 src/port/pg_crc32c_armv8_choose.c | 49 +++++++++++++++-
 9 files changed, 285 insertions(+), 12 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..483d4724d1 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,31 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..9b6118164d 100755
--- a/configure
+++ b/configure
@@ -18038,6 +18038,44 @@ fi
 
 
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -18089,6 +18127,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18112,7 +18157,7 @@ $as_echo "SSE 4.2 with runtime check" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
     else
@@ -18145,6 +18190,18 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM VMULL intrinsic" >&5
+$as_echo_n "checking whether to use ARM VMULL intrinsic... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..71a84bb151 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,12 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2158,6 +2164,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2172,7 +2185,7 @@ else
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
       AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       AC_MSG_RESULT(ARMv8 CRC instructions)
     else
       if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
@@ -2195,6 +2208,13 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM VMULL intrinsic])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension.])
+  AC_MSG_RESULT(yes)
+else
+  AC_MSG_RESULT(no)
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..f1615c3549 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,30 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM VMULL Extension unconditionally
+    cdata.set('USE_ARMV8_VMULL', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..65cd43e156 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 VMULL Extension. */
+#undef USE_ARMV8_VMULL
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..35eb689a3b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,14 +49,20 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_ARMV8_CRC32C)
+#elif defined(USE_ARMV8_CRC32C) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
-	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+
+#if defined(USE_ARMV8_VMULL)
+#include<arm_neon.h>
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -67,10 +73,10 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * Use Intel SSE 4.2 instructions, but perform a runtime check first
  * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
@@ -83,9 +89,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 #endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..35e347de59 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -89,7 +89,9 @@ replace_funcs_pos = [
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
   # loongarch
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..672a4e417b 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -18,6 +19,99 @@
 
 #include "port/pg_crc32c.h"
 
+#if defined(USE_ARMV8_VMULL)
+#include <arm_neon.h>
+__attribute__((target("+crypto")))
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	const unsigned char *pend = p + len;
+
+	/*
+	 * ARMv8 doesn't require alignment, but aligned memory access is
+	 * significantly faster. Process leading bytes so that the loop below
+	 * starts with a pointer aligned to eight bytes.
+	 */
+	if (!PointerIsAligned(p, uint16) &&
+		p + 1 <= pend)
+	{
+		crc = __crc32cb(crc, *p);
+		p += 1;
+	}
+	if (!PointerIsAligned(p, uint32) &&
+		p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+	if (!PointerIsAligned(p, uint64) &&
+		p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+	while (p + 1024 <= pend)
+	{
+#define BLOCK_LEN 42
+		const uint64_t *in64 = (const uint64_t *) (p);
+		uint32_t	crc0 = crc,
+					crc1 = 0,
+					crc2 = 0;
+
+		for (int i = 0; i < BLOCK_LEN; i++, in64++)
+		{
+			crc0 = __crc32cd(crc0, *(in64));
+			crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+			crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+		}
+		in64 += BLOCK_LEN * 2;
+		crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+		crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+		crc = crc0 ^ crc1 ^ crc2;
+
+		crc = __crc32cd(crc, *in64++);
+		crc = __crc32cd(crc, *in64++);
+
+		p += 1024;
+#undef BLOCK_LEN
+	}
+
+	/* Process eight bytes at a time, as far as we can. */
+	while (p + 8 <= pend)
+	{
+		crc = __crc32cd(crc, *(uint64 *) p);
+		p += 8;
+	}
+
+	/* Process remaining 0-7 bytes. */
+	if (p + 4 <= pend)
+	{
+		crc = __crc32cw(crc, *(uint32 *) p);
+		p += 4;
+	}
+	if (p + 2 <= pend)
+	{
+		crc = __crc32ch(crc, *(uint16 *) p);
+		p += 2;
+	}
+	if (p < pend)
+	{
+		crc = __crc32cb(crc, *p);
+	}
+
+	return crc;
+}
+#endif
+
 pg_crc32c
 pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 {
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..2a3b8ba907 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -77,6 +77,36 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+#if defined(USE_ARMV8_VMULL)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +115,24 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

crc32c_benchmark.ctext/plain; name=crc32c_benchmark.cDownload

/*********************************************************************
* compile postgres first with different crc32c implementation(use arm vmull_p64 or not)
* we should comment out some codes about elog in pg_crc32c_armv8_choose.c to compile correctly and simply.
* $ gcc   -I ../postgres/_install/include -I ../postgres/_install/include/server main.c \
* -L ../postgres/build/src/port -l pgport_srv -O2  -o main

* this test was run on Neoverse-N1
* $ ./main.no_vmull
* data size is 512 bytes, and compute crc cost 139 us totally, 0.135742 us per loop
* data size is 4096 bytes, and compute crc cost 1061 us totally, 1.036133 us per loop

* $ ./main.use_vmull
* data size is 512 bytes, and compute crc cost 101 us totally, 0.098633 us per loop
* data size is 4096 bytes, and compute crc cost 540 us totally, 0.527344 us per loop

* We can see that the cost of computing crc32c without vmull_p64 is about two times than
* the cost that using vmull_p64 when data size is large. and the cost is almost same when 
* data size is small.
*********************************************************************/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>
#include <memory.h>

#include "c.h"
#include "port/pg_crc32c.h"

uint64_t
GetTickCount()
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000 + tv.tv_usec;
}

int
main()
{
#define CASE_CNT 2
	uint32_t	test_size[CASE_CNT] = {512, 1024 * 4};

	for (int case_cnt = 0; case_cnt < CASE_CNT; case_cnt++)
	{
		uint8_t    *buf = (uint8_t *) malloc(test_size[case_cnt] * sizeof(uint8_t));

		srand(0);
		for (int i = 0; i < test_size[case_cnt]; i++)
		{
			*(buf + i) = (uint8_t) (rand() % 256u);
		}

		static const uint32_t kLoop = 1024;
		uint32_t	crc = 0;
		uint64_t	start = GetTickCount();

		INIT_CRC32C(crc);
		for (int i = 0; i < kLoop; i++)
		{
			COMP_CRC32C(crc, buf, test_size[case_cnt]);
		}
		FIN_CRC32C(crc);
		uint64_t	stop = GetTickCount();

		printf("data size is %d bytes, and compute crc cost %ld us totally, %f us per loop\n", test_size[case_cnt], stop - start, (double) (stop - start) / kLoop);
		
		free(buf);
	}
#undef CASE_CNT
	return 0;
}

crc32c_unitest.ctext/plain; name=crc32c_unitest.cDownload

/*******************************************************************************
* We use libcheck(https://github.com/libcheck/check) as unit testing framework.

* compile postgres first with different crc32c implementation(use arm crc32c
* and vmull intrisics or not). we should comment out some codes about elog in
* pg_crc32c_armv8_choose.c to compile correctly and simply.
* $ gcc -I ../postgres/_install/include -I ../postgres/_install/include/server \
  crc32c_unittest.c  -L ../postgres/build/src/port -l pgport_srv  -L /usr/local/lib \
  -lcheck  -o crc32c_unittest

* this test was run on Neoverse-N1
* $ ./crc32c_unittest 
* Running suite(s): CRC32C
* 100%: Checks: 3, Failures: 0, Errors: 0
*******************************************************************************/
#include <stdlib.h>
#include <check.h>

#include "c.h"
#include "port/pg_crc32c.h"

START_TEST (test_crc32c_0)
{
    int crc = 0;

    int data = 0;

    INIT_CRC32C(crc);
    COMP_CRC32C(crc, &data, sizeof(int));
    FIN_CRC32C(crc);
    ck_assert_int_eq(crc, 0x48674bc7);
}
END_TEST

START_TEST (test_crc32c_small_size)
{
    int crc = 0;

    int size = 512;
    uint8_t *buf = (uint8_t*)malloc(size * sizeof(uint8_t));
    memset(buf, 0, size * sizeof(uint8_t));

    INIT_CRC32C(crc);
    COMP_CRC32C(crc, buf, size * sizeof(uint8_t));
    FIN_CRC32C(crc);
    ck_assert_int_eq(crc, 0x30fcedc0);
    
    free(buf);
}
END_TEST

START_TEST (test_crc32c_large_size)
{
    int crc = 0;

    int size = 4096;
    uint8_t *buf = (uint8_t*)malloc(size * sizeof(uint8_t));
    for (int i = 0; i < size; i++) {
        *(buf + i) |= 0xFF;
    }

    INIT_CRC32C(crc);
    COMP_CRC32C(crc, buf, size * sizeof(uint8_t));
    FIN_CRC32C(crc);
    ck_assert_int_eq(crc, 0x25c1fe13);
    
    free(buf);
}
END_TEST


Suite * crc32c_suite(void)
{
    Suite *s;
    TCase *tc_core;

    s = suite_create("CRC32C");

    /* Core test case */
    tc_core = tcase_create("Core");

    tcase_add_test(tc_core, test_crc32c_0);
    tcase_add_test(tc_core, test_crc32c_small_size);
    tcase_add_test(tc_core, test_crc32c_large_size);
    suite_add_tcase(s, tc_core);

    return s;
}

int main()
{
    int number_failed;
    Suite *s;
    SRunner *sr;

    s = crc32c_suite();
    sr = srunner_create(s);

    srunner_run_all(sr, CK_NORMAL);
    number_failed = srunner_ntests_failed(sr);
    srunner_free(sr);
    return (number_failed == 0) ? EXIT_SUCCESS : EXIT_FAILURE;
}

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#8)

Re: CRC32C Parallel Computation Optimization on ARM

+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

It looks like most of this function is duplicated from
pg_comp_crc32c_armv8(). I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used. It's callers would then just provide a boolean to indicate which
branch to take.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi

Hm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.

Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS. The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#10

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#9)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:

+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

It looks like most of this function is duplicated from
pg_comp_crc32c_armv8(). I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used. It's callers would then just provide a boolean to indicate which
branch to take.

I have modified and remade the patch.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test  x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi

Hm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS. The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).

Yes, I think so. USE_ARMV8_CRC32C only means that the compilation is successful,
and it does not guarantee that it can run correctly on the local machine.
Therefore, a runtime check is required during actual operation.
Based on the principle of minimal changes, we plan to fix it in the next patch.
If the community agrees, we will continue to improve it later, such as merging x86 and arm code, etc.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0003-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0003-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 5b8bac95b647631a1b24156d7018e339df8c37a2 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              | 25 +++++++++++++
 configure                         | 59 ++++++++++++++++++++++++++++++-
 configure.ac                      | 22 +++++++++++-
 meson.build                       | 24 +++++++++++++
 src/include/pg_config.h.in        |  3 ++
 src/include/port/pg_crc32c.h      | 19 +++++-----
 src/port/meson.build              |  2 ++
 src/port/pg_crc32c_armv8.c        | 57 +++++++++++++++++++++++++++--
 src/port/pg_crc32c_armv8_choose.c | 49 +++++++++++++++++++++++--
 9 files changed, 246 insertions(+), 14 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..483d4724d1 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,31 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..9b6118164d 100755
--- a/configure
+++ b/configure
@@ -18038,6 +18038,44 @@ fi
 
 
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -18089,6 +18127,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18112,7 +18157,7 @@ $as_echo "SSE 4.2 with runtime check" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
     else
@@ -18145,6 +18190,18 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM VMULL intrinsic" >&5
+$as_echo_n "checking whether to use ARM VMULL intrinsic... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..71a84bb151 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,12 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2158,6 +2164,13 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2172,7 +2185,7 @@ else
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
       AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
       AC_MSG_RESULT(ARMv8 CRC instructions)
     else
       if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
@@ -2195,6 +2208,13 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM VMULL intrinsic])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension.])
+  AC_MSG_RESULT(yes)
+else
+  AC_MSG_RESULT(no)
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..f1615c3549 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,30 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM VMULL Extension unconditionally
+    cdata.set('USE_ARMV8_VMULL', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..65cd43e156 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,9 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 VMULL Extension. */
+#undef USE_ARMV8_VMULL
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..35eb689a3b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,14 +49,20 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_ARMV8_CRC32C)
+#elif defined(USE_ARMV8_CRC32C) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
-	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+
+#if defined(USE_ARMV8_VMULL)
+#include<arm_neon.h>
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -67,10 +73,10 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * Use Intel SSE 4.2 instructions, but perform a runtime check first
  * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
@@ -83,9 +89,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 #endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..35e347de59 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -89,7 +89,9 @@ replace_funcs_pos = [
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
   # loongarch
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..b499ea0008 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,13 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+__attribute__((target("+crypto")))
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +51,42 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +112,17 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if defined(USE_ARMV8_VMULL)
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..2a3b8ba907 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -77,6 +77,36 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+#if defined(USE_ARMV8_VMULL)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +115,24 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#11

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#7)

RE: CRC32C Parallel Computation Optimization on ARM

On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:

I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Xiang Gao (#11)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao <Xiang.Gao@arm.com> wrote:

On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:

I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

Here's a script that I use to generate WAL records of various sizes,
change it to taste if useful:

for m in 16 64 256 1024 4096 8192 16384
do
echo "Start of run with WAL size \$m bytes at:"
date
echo "SELECT pg_logical_emit_message(true, 'mymessage',
repeat('d', \$m));" >> $JUMBO/scripts/dumbo\$m.sql
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096
do
$PGWORKSPACE/pgbench -n postgres -c\$c -j\$c -T60 -f
$JUMBO/scripts/dumbo\$m.sql > $JUMBO/results/dumbo\$m:\$c.out
done
echo "End of run with WAL size \$m bytes at:"
date
echo "\n"
done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#10)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Oct 26, 2023 at 07:28:35AM +0000, Xiang Gao wrote:

On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test  x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
Hm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#14

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#11)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Oct 26, 2023 at 08:53:31AM +0000, Xiang Gao wrote:

On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:

I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

I did something like

do $$
begin
for i in 1..1000000
loop
perform pg_logical_emit_message(false, 'test', repeat('0123456789', 800));
end loop;
end;
$$;

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#15

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#13)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0004-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0004-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 1bcf659ae6ea6b864dfc791e516dff0a313837f4 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              |  26 ++++-
 configure                         | 155 +++++++++++++++---------------
 configure.ac                      |  77 ++++++++-------
 meson.build                       |  35 +++++--
 src/include/pg_config.h.in        |   8 +-
 src/include/port/pg_crc32c.h      |  17 ++--
 src/port/Makefile                 |   5 -
 src/port/meson.build              |   5 +-
 src/port/pg_crc32c_armv8.c        |  57 ++++++++++-
 src/port/pg_crc32c_armv8_choose.c |  50 +++++++++-
 10 files changed, 288 insertions(+), 147 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..c3731cabd6 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -656,12 +656,36 @@ AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_acle.h>],
   [Ac_cachevar=no])
 CFLAGS="$pgac_save_CFLAGS"])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_CRC="$1"
   pgac_armv8_crc32c_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..42f0c160ad 100755
--- a/configure
+++ b/configure
@@ -17909,51 +17909,9 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 
 # Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
 #
-# First check if __crc32c* intrinsics can be used with the default compiler
-# flags. If not, check if adding -march=armv8-a+crc flag helps.
-# CFLAGS_CRC is set if the extra flag is required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=" >&5
-$as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=... " >&6; }
-if ${pgac_cv_armv8_crc32c_intrinsics_+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <arm_acle.h>
-int
-main ()
-{
-unsigned int crc = 0;
-   crc = __crc32cb(crc, 0);
-   crc = __crc32ch(crc, 0);
-   crc = __crc32cw(crc, 0);
-   crc = __crc32cd(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_armv8_crc32c_intrinsics_=yes
-else
-  pgac_cv_armv8_crc32c_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crc32c_intrinsics_" >&5
-$as_echo "$pgac_cv_armv8_crc32c_intrinsics_" >&6; }
-if test x"$pgac_cv_armv8_crc32c_intrinsics_" = x"yes"; then
-  CFLAGS_CRC=""
-  pgac_armv8_crc32c_intrinsics=yes
-fi
-
-if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc" >&5
+# check if __crc32c* intrinsics can be used with the compiler
+# flags -march=armv8-a+crc
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc" >&5
 $as_echo_n "checking for __crc32cb, __crc32ch, __crc32cw, and __crc32cd with CFLAGS=-march=armv8-a+crc... " >&6; }
 if ${pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc+:} false; then :
   $as_echo_n "(cached) " >&6
@@ -17989,11 +17947,9 @@ fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" >&5
 $as_echo "$pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" >&6; }
 if test x"$pgac_cv_armv8_crc32c_intrinsics__march_armv8_apcrc" = x"yes"; then
-  CFLAGS_CRC="-march=armv8-a+crc"
   pgac_armv8_crc32c_intrinsics=yes
 fi
 
-fi
 
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
@@ -18038,6 +17994,44 @@ fi
 
 
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -18057,7 +18051,7 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
   # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
@@ -18068,27 +18062,29 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
       # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # LoongArch CRCC instructions.
+        if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+          USE_LOONGARCH_CRC32C=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+          # fall back to slicing-by-8 algorithm, which doesn't require any
+          # special CPU support.
+          USE_SLICING_BY_8_CRC32C=1
         fi
       fi
     fi
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_CRC32C" = x"1"; then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18112,39 +18108,42 @@ $as_echo "SSE 4.2 with runtime check" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
-$as_echo "ARMv8 CRC instructions" >&6; }
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with a runtime check" >&5
+$as_echo "ARMv8 CRC instructions with a runtime check" >&6; }
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-
-$as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
-
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
-$as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+      if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+      else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-        fi
       fi
     fi
   fi
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM VMULL intrinsic with a runtime check" >&5
+$as_echo_n "checking whether to use ARM VMULL intrinsic with a runtime check... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..8e83f86554 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2091,13 +2091,9 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
 
 # Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
 #
-# First check if __crc32c* intrinsics can be used with the default compiler
-# flags. If not, check if adding -march=armv8-a+crc flag helps.
-# CFLAGS_CRC is set if the extra flag is required.
-PGAC_ARMV8_CRC32C_INTRINSICS([])
-if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
-  PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
-fi
+# check if __crc32c* intrinsics can be used with the compiler
+# flags -march=armv8-a+crc
+PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
 
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
@@ -2107,6 +2103,12 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# Check if vmull_p64 intrinsics can be used with the compiler
+# flag -march=armv8-a+crypto.
+PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2126,7 +2128,7 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
   # Use Intel SSE 4.2 if available.
   if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
@@ -2137,27 +2139,29 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
       # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
         USE_ARMV8_CRC32C=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # LoongArch CRCC instructions.
+        if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+          USE_LOONGARCH_CRC32C=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+          # fall back to slicing-by-8 algorithm, which doesn't require any
+          # special CPU support.
+          USE_SLICING_BY_8_CRC32C=1
         fi
       fi
     fi
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_CRC32C" = x"1"; then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2171,30 +2175,31 @@ else
     AC_MSG_RESULT(SSE 4.2 with runtime check)
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with a runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
-        else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
-        fi
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM VMULL intrinsic with a runtime check])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension with a runtime check.])
+  AC_MSG_RESULT(yes)
+else
+  AC_MSG_RESULT(no)
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..fcc7e401a3 100644
--- a/meson.build
+++ b/meson.build
@@ -2054,17 +2054,10 @@ int main(void)
 }
 '''
 
-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
-      args: test_c_args)
-    # Use ARM CRC Extension unconditionally
-    cdata.set('USE_ARMV8_CRC32C', 1)
-    have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
+  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
       args: test_c_args + ['-march=armv8-a+crc'])
     # Use ARM CRC Extension, with runtime check
-    cflags_crc += '-march=armv8-a+crc'
-    cdata.set('USE_ARMV8_CRC32C', false)
-    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+    cdata.set('USE_ARMV8_CRC32C', 1)
     have_optimized_crc = true
   endif
 
@@ -2101,6 +2094,30 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM VMULL Extension, with runtime check
+    cdata.set('USE_ARMV8_VMULL', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..6ae160551d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -683,11 +683,11 @@
 /* Define to 1 if strerror_r() returns int. */
 #undef STRERROR_R_INT
 
-/* Define to 1 to use ARMv8 CRC Extension. */
-#undef USE_ARMV8_CRC32C
-
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
-#undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
+#undef USE_ARMV8_CRC32C
+
+/* Define to 1 to use ARMv8 VMULL Extension with a runtime check. */
+#undef USE_ARMV8_VMULL
 
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..c1fe4dc7bd 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -51,12 +51,18 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
-	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+
+#if defined(USE_ARMV8_VMULL)
+#include<arm_neon.h>
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -67,10 +73,10 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * Use Intel SSE 4.2 instructions, but perform a runtime check first
  * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
@@ -83,9 +89,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 #endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
diff --git a/src/port/Makefile b/src/port/Makefile
index f205c2c9c5..e59e097e03 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,11 +89,6 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..0dd794f28c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -88,9 +88,8 @@ replace_funcs_pos = [
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C'],
+  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C'],
 
   # loongarch
   ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'],
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..5963f112d7 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,13 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+__attribute__((target("+crc+crypto")))
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +51,42 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +112,17 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if defined(USE_ARMV8_VMULL)
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..07f1eb4bfe 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -26,6 +26,7 @@
 
 #include <setjmp.h>
 #include <signal.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
@@ -77,6 +78,36 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+#if defined(USE_ARMV8_VMULL)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +116,24 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#16

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#15)

Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:

On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.

Thanks. I went ahead and split this prerequisite part out to a separate
thread [0]/messages/by-id/20231030161706.GA3011@nathanxps13 since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.

[0]: /messages/by-id/20231030161706.GA3011@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#17

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Nathan Bossart (#16)

Re: CRC32C Parallel Computation Optimization on ARM

On Mon, Oct 30, 2023 at 11:21:43AM -0500, Nathan Bossart wrote:

On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:

On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.

Thanks. I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.

Per the other thread [0]/messages/by-id/2620794.1698783160@sss.pgh.pa.us, we should try to avoid the runtime check when
possible, as it seems to produce a small regression. This means that if
the ARMv8 CRC instructions are found with the default compiler flags, we
can only use vmull_p64() if it can also be used with the default flags.
Otherwise, we can just do the runtime check.

[0]: /messages/by-id/2620794.1698783160@sss.pgh.pa.us

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#18

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#17)

RE: CRC32C Parallel Computation Optimization on ARM

On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote:

Thanks. I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.

Per the other thread [0], we should try to avoid the runtime check when
possible, as it seems to produce a small regression. This means that if
the ARMv8 CRC instructions are found with the default compiler flags, we
can only use vmull_p64() if it can also be used with the default flags.
Otherwise, we can just do the runtime check.

[0] /messages/by-id/2620794.1698783160@sss.pgh.pa.us

After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
if so, we need to check whether to use the ARM CRC32C and VMULL instruction
directly or with runtime check. There will be many scenarios here and the code
will be more complicated.
Could you please give me some suggestions about how to refine this patch?
Thanks very much!

#19

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#18)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:

After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.

The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags. I
suspect this limitation sounds worse than it actually is in practice. The
vast majority of the buildfarm uses runtime checks, and at least some of
the platforms that don't, such as the Apple M-series machines, seem to
include +crypto by default.

Of course, if a compiler picks up +crc but not +crypto in its defaults, we
could lose the vmull_p64() optimization on that platform. But as noted in
the other thread, we can revisit if these kinds of hypothetical situations
become reality.

Could you please give me some suggestions about how to refine this patch?

Of course. I think we'll ultimately want to independently check for the
availability of the new instruction like we do for the other sets of
intrinsics:

PGAC_ARMV8_VMULL_INTRINSICS([])
if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
fi

My current thinking is that we'll want to add USE_ARMV8_VMULL and
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
compile. I'll admit I haven't fully thought through every detail yet, but
I'm cautiously optimistic that we can avoid too much complexity in the
autoconf/meson scripts.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#19)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:

On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:

After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.

The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags. I
suspect this limitation sounds worse than it actually is in practice. The
vast majority of the buildfarm uses runtime checks, and at least some of
the platforms that don't, such as the Apple M-series machines, seem to
include +crypto by default.

Of course, if a compiler picks up +crc but not +crypto in its defaults, we
could lose the vmull_p64() optimization on that platform. But as noted in
the other thread, we can revisit if these kinds of hypothetical situations
become reality.

Could you please give me some suggestions about how to refine this patch?

Of course. I think we'll ultimately want to independently check for the
availability of the new instruction like we do for the other sets of
intrinsics:

PGAC_ARMV8_VMULL_INTRINSICS([])
if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
fi

My current thinking is that we'll want to add USE_ARMV8_VMULL and
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
compile. I'll admit I haven't fully thought through every detail yet, but
I'm cautiously optimistic that we can avoid too much complexity in the
autoconf/meson scripts.

Thank you so much!
This is the newest patch, I think the code for which crc algorithm to choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only, and do runtime checks on the vmull_p64 instruction at all times. This will not affect the existing builds, because this is a new instruction and new logic. In addition, it can also reduce the complexity of the code.
Very much looking forward to receiving your suggestions, thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0005-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0005-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 8cfec9c59acd15b6910a6f45d9e0786899596848 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              |  26 +++++++
 configure                         | 109 +++++++++++++++++++++++++++++-
 configure.ac                      |  35 +++++++++-
 meson.build                       |  34 ++++++++++
 src/include/pg_config.h.in        |   6 ++
 src/include/port/pg_crc32c.h      |  22 +++---
 src/port/Makefile                 |   5 --
 src/port/meson.build              |   3 +-
 src/port/pg_crc32c_armv8.c        |  57 +++++++++++++++-
 src/port/pg_crc32c_armv8_choose.c |  78 +++++++++++++++++++--
 10 files changed, 353 insertions(+), 22 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..aa9625000e 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,32 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_VMULL="$1"
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..f385a770ce 100755
--- a/configure
+++ b/configure
@@ -17995,6 +17995,82 @@ fi
 
 fi
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_VMULL is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics_=yes
+else
+  pgac_cv_armv8_vmull_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics_" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics_" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics_" = x"yes"; then
+  CFLAGS_VMULL=""
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  CFLAGS_VMULL="-march=armv8-a+crypto"
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -18089,6 +18165,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18112,7 +18199,7 @@ $as_echo "SSE 4.2 with runtime check" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_armv8_choose.o"
       { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
     else
@@ -18145,6 +18232,26 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which VMULL implementation to use" >&5
+$as_echo_n "checking which VMULL implementation to use... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 VMULL instructions" >&5
+$as_echo "ARMv8 VMULL instructions" >&6; }
+else
+  if test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 VMULL instructions with runtime check" >&5
+$as_echo "ARMv8 VMULL instructions with runtime check" >&6; }
+  else
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: none required" >&5
+$as_echo "none required" >&6; }
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..8f58bff883 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2099,6 +2099,16 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
   PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
 fi
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_VMULL is set if the extra flag is required.
+PGAC_ARMV8_VMULL_INTRINSICS([])
+if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
+  PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -2158,6 +2168,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2172,7 +2193,7 @@ else
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
       AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_armv8_choose.o"
       AC_MSG_RESULT(ARMv8 CRC instructions)
     else
       if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
@@ -2195,6 +2216,18 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([which VMULL implementation to use])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension.])
+  AC_MSG_RESULT(ARMv8 VMULL instructions)
+else
+  if test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 VMULL Extension with a runtime check.])
+    AC_MSG_RESULT(ARMv8 VMULL instructions with runtime check)
+  else
+    AC_MSG_RESULT(none required)
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..36557b5389 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,40 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+
+# If we are targeting an ARM processor that has the VMULL instructions
+# that is part of the ARMv8 CRYPTO Extension, use them. And if we're not
+# targeting such a processor, but can nevertheless produce code that
+# uses the VMULL instructions, compile both, and select at runtime.
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 without -march=armv8-a+crypto',
+      args: test_c_args)
+    # Use ARM VMULL Extension
+    cdata.set('USE_ARMV8_VMULL', 1)
+  elif cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM VMULL Extension, with runtime check
+    cdata.set('USE_ARMV8_VMULL', false)
+    cdata.set('USE_ARMV8_VMULL_WITH_RUNTIME_CHECK', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..a3bf864ba8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,12 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 VMULL Extension. */
+#undef USE_ARMV8_VMULL
+
+/* Define to 1 to use ARMv8 VMULL Extension with a runtime check. */
+#undef USE_ARMV8_VMULL_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..bb73be2449 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,15 +49,24 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_ARMV8_CRC32C)
+#elif defined(USE_ARMV8_CRC32C) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
-	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 
+#if defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+#endif
+
+#if defined(USE_ARMV8_VMULL) || defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+#include<arm_neon.h>
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
+
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
 
@@ -67,10 +76,10 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
+ * Use Intel SSE 4.2 instructions, but perform a runtime check first
  * to check that they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
@@ -83,9 +92,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 #endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
diff --git a/src/port/Makefile b/src/port/Makefile
index f205c2c9c5..e59e097e03 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,11 +89,6 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..43c7478a3b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -88,7 +88,8 @@ replace_funcs_pos = [
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..6824f3c5c3 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,13 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+__attribute__((target("+crc+crypto")))
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +51,42 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +112,17 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if defined(USE_ARMV8_VMULL) || defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..913abcf675 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -30,6 +30,7 @@
 #include "port/pg_crc32c.h"
 
 
+#if defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
 static sigjmp_buf illegal_instruction_jump;
 
 /*
@@ -41,7 +42,9 @@ illegal_instruction_handler(SIGNAL_ARGS)
 {
 	siglongjmp(illegal_instruction_jump, 1);
 }
+#endif
 
+#if defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 static bool
 pg_crc32c_armv8_available(void)
 {
@@ -76,6 +79,37 @@ pg_crc32c_armv8_available(void)
 
 	return (result > 0);
 }
+#endif
+
+#if defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -84,10 +118,46 @@ pg_crc32c_armv8_available(void)
 static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
-	if (pg_crc32c_armv8_available())
+#if defined(USE_ARMV8_CRC32C)
+
+#if defined(USE_ARMV8_VMULL)
+	pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+	if (pg_vmull_armv8_available())
+	{
+		pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+	}
+	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
-	else
+	}
+#else
+	pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+
+	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+#endif
+	}
+
+#else
+	pg_comp_crc32c = pg_comp_crc32c_sb8;
+#endif
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#21

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#20)

Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Nov 03, 2023 at 10:46:57AM +0000, Xiang Gao wrote:

On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:

The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags.

This is the newest patch, I think the code for which crc algorithm to
choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only,
and do runtime checks on the vmull_p64 instruction at all times. This
will not affect the existing builds, because this is a new instruction
and new logic. In addition, it can also reduce the complexity of the
code.

I don't think we can. AFAICT a runtime check necessitates a function
pointer or a branch, both of which incurred an impact on performance in my
tests. It looks like this latest patch still does the runtime check even
for the USE_ARMV8_CRC32C case.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#22

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#21)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Mon, 6 Nov 2023 13:16:13PM -0600, Nathan Bossart wrote:

The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags.

This is the newest patch, I think the code for which crc algorithm to
choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only,
and do runtime checks on the vmull_p64 instruction at all times. This
will not affect the existing builds, because this is a new instruction
and new logic. In addition, it can also reduce the complexity of the
code.

I don't think we can. AFAICT a runtime check necessitates a function
pointer or a branch, both of which incurred an impact on performance in my
tests. It looks like this latest patch still does the runtime check even
for the USE_ARMV8_CRC32C case.

I think I understand what you mean, this is the latest patch. Thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0006-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0006-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 42ee70302c53cf67ee3687d393f2be8a510789e3 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              |  26 ++++++++
 configure                         | 107 ++++++++++++++++++++++++++++++
 configure.ac                      |  33 +++++++++
 meson.build                       |  34 ++++++++++
 src/include/pg_config.h.in        |   6 ++
 src/include/port/pg_crc32c.h      |  14 +++-
 src/port/Makefile                 |   5 --
 src/port/meson.build              |   2 +-
 src/port/pg_crc32c_armv8.c        |  59 +++++++++++++++-
 src/port/pg_crc32c_armv8_choose.c |  52 ++++++++++++++-
 10 files changed, 325 insertions(+), 13 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..aa9625000e 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,32 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_vmull_intrinsics, and CFLAGS_VMULL.
+AC_DEFUN([PGAC_ARMV8_VMULL_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_vmull_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_VMULL="$1"
+  pgac_armv8_vmull_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_VMULL_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..08413869f0 100755
--- a/configure
+++ b/configure
@@ -17995,6 +17995,82 @@ fi
 
 fi
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_VMULL is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics_=yes
+else
+  pgac_cv_armv8_vmull_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics_" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics_" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics_" = x"yes"; then
+  CFLAGS_VMULL=""
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_vmull_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  CFLAGS_VMULL="-march=armv8-a+crypto"
+  pgac_armv8_vmull_intrinsics=yes
+fi
+
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -18089,6 +18165,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18145,6 +18232,26 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which VMULL implementation to use" >&5
+$as_echo_n "checking which VMULL implementation to use... " >&6; }
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 VMULL instructions" >&5
+$as_echo "ARMv8 VMULL instructions" >&6; }
+else
+  if test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMV8_VMULL_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 VMULL instructions with runtime check" >&5
+$as_echo "ARMv8 VMULL instructions with runtime check" >&6; }
+  else
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: none required" >&5
+$as_echo "none required" >&6; }
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..b97c09199e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2099,6 +2099,16 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
   PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
 fi
 
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_VMULL is set if the extra flag is required.
+PGAC_ARMV8_VMULL_INTRINSICS([])
+if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
+  PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -2158,6 +2168,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
 if test x"$USE_SSE42_CRC32C" = x"1"; then
@@ -2195,6 +2216,18 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([which VMULL implementation to use])
+if test x"$USE_ARMV8_VMULL" = x"1"; then
+  AC_DEFINE(USE_ARMV8_VMULL, 1, [Define to 1 to use ARMv8 VMULL Extension.])
+  AC_MSG_RESULT(ARMv8 VMULL instructions)
+else
+  if test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 VMULL Extension with a runtime check.])
+    AC_MSG_RESULT(ARMv8 VMULL instructions with runtime check)
+  else
+    AC_MSG_RESULT(none required)
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..36557b5389 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,40 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 VMULL intrinsics to do polynomial multiplication
+
+# If we are targeting an ARM processor that has the VMULL instructions
+# that is part of the ARMv8 CRYPTO Extension, use them. And if we're not
+# targeting such a processor, but can nevertheless produce code that
+# uses the VMULL instructions, compile both, and select at runtime.
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 without -march=armv8-a+crypto',
+      args: test_c_args)
+    # Use ARM VMULL Extension
+    cdata.set('USE_ARMV8_VMULL', 1)
+  elif cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM VMULL Extension, with runtime check
+    cdata.set('USE_ARMV8_VMULL', false)
+    cdata.set('USE_ARMV8_VMULL_WITH_RUNTIME_CHECK', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..a3bf864ba8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,12 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 VMULL Extension. */
+#undef USE_ARMV8_VMULL
+
+/* Define to 1 to use ARMv8 VMULL Extension with a runtime check. */
+#undef USE_ARMV8_VMULL_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..fa95b9075a 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -50,13 +50,20 @@ typedef uint32 pg_crc32c;
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
 #elif defined(USE_ARMV8_CRC32C)
+
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+#ifdef USE_ARMV8_VMULL
+/* Use ARMv8 CRC Extension and VMULL instructions. */
+#define COMP_CRC32C(crc, data, len)							\
+	((crc) = pg_comp_crc32c_with_vmull_armv8((crc), (data), (len)))
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
+#else
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -85,6 +92,7 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #endif
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);
 #endif
 
 #else
diff --git a/src/port/Makefile b/src/port/Makefile
index f205c2c9c5..e59e097e03 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,11 +89,6 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..1686adb531 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -88,7 +88,7 @@ replace_funcs_pos = [
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..eed10276e9 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,13 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+__attribute__((target("+crc+crypto")))
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +51,42 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +112,19 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if (defined(USE_ARMV8_CRC32C) && defined(USE_ARMV8_VMULL)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+#if (defined(USE_ARMV8_CRC32C) && !defined(USE_ARMV8_VMULL)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
+#endif
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..227fb830cd 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -26,6 +26,7 @@
 
 #include <setjmp.h>
 #include <signal.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
@@ -77,6 +78,36 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+#if defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+}
+#endif
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +116,26 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#23

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#22)

Re: CRC32C Parallel Computation Optimization on ARM

On Tue, Nov 07, 2023 at 08:05:45AM +0000, Xiang Gao wrote:

I think I understand what you mean, this is the latest patch. Thank you!

Thanks for the new patch.

+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.

I wonder if it'd be better to call this PGAC_ARMV8_CRYPTO_INTRINSICS since
this check seems to indicate the presence of +crypto. Presumably there are
other instructions in this extension that could be used elsewhere, in which
case we could reuse this.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi

I'm not sure I see the need to check USE_ARMV8_CRC32C* when setting these.
Couldn't we set them solely on the results of our
PGAC_ARMV8_VMULL_INTRINSICS check? It looks like this is what you are
doing in meson.build already.

+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);

nitpick: Maybe pg_comp_crc32_armv8_parallel()?

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)

Why are these lines deleted?

-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],

What is the purpose of this change?

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

+	if (use_vmull)
+	{
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */

Shouldn't we surround this with #ifdefs for USE_ARMV8_VMULL*?

 	if (pg_crc32c_armv8_available())
+	{
+#if defined(USE_ARMV8_VMULL)
+		pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+#else
 		pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+	}

IMO it'd be better to move the #ifdefs into the functions so that we can
simplify this to something like

if (pg_crc32c_armv8_available())
{
if (pg_crc32c_armv8_crypto_available())
pg_comp_crc32c = pg_comp_crc32c_armv8_parallel;
else
pg_comp_crc32c = pg_comp_crc32c_armv8;
else
pc_comp_crc32c = pg_comp_crc32c_sb8;

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#24

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#23)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)

Why are these lines deleted?
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
What is the purpose of this change?

Because I added `__attribute__((target("+crc+crypto")))` before the functions that require crc extension and crypto extension, so they are removed here.

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported, __attribute__ is also supported.
In addition, I am not sure about the source file pg_crc32c_armv8.c, if CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it be expressed in the makefile?

Attachments:

0007-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0007-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 25b9857012b41149c8e900754077bc2f852fc187 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              |  26 ++++++++
 configure                         | 107 ++++++++++++++++++++++++++++++
 configure.ac                      |  33 +++++++++
 meson.build                       |  34 ++++++++++
 src/include/pg_config.h.in        |   6 ++
 src/include/port/pg_crc32c.h      |  14 +++-
 src/port/Makefile                 |   5 --
 src/port/meson.build              |   2 +-
 src/port/pg_crc32c_armv8.c        |  61 ++++++++++++++++-
 src/port/pg_crc32c_armv8_choose.c |  52 ++++++++++++++-
 10 files changed, 326 insertions(+), 14 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..89fc31e12a 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,32 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_CRYPTO_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the PMULL instructions
+# using the vmull_p64 intrinsic function. These instructions
+# were first introduced in ARMv8 CRYPTO Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_crypto_intrinsics, and CFLAGS_CRYPTO.
+AC_DEFUN([PGAC_ARMV8_CRYPTO_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_crypto_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRYPTO="$1"
+  pgac_armv8_crypto_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_CRYPTO_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..85375bfcf8 100755
--- a/configure
+++ b/configure
@@ -17995,6 +17995,82 @@ fi
 
 fi
 
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_CRYPTO is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_crypto_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crypto_intrinsics_=yes
+else
+  pgac_cv_armv8_crypto_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crypto_intrinsics_" >&5
+$as_echo "$pgac_cv_armv8_crypto_intrinsics_" >&6; }
+if test x"$pgac_cv_armv8_crypto_intrinsics_" = x"yes"; then
+  CFLAGS_CRYPTO=""
+  pgac_armv8_crypto_intrinsics=yes
+fi
+
+if test x"$pgac_armv8_crypto_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  CFLAGS_CRYPTO="-march=armv8-a+crypto"
+  pgac_armv8_crypto_intrinsics=yes
+fi
+
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -18089,6 +18165,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM CRYPTO Extension if available.
+if test x"$USE_ARMV8_CRYPTO" = x"" && test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x""; then
+  if test x"$pgac_armv8_crypto_intrinsics" = x"yes" && test x"$CFLAGS_CRYPTO" = x""; then
+    USE_ARMV8_CRYPTO=1
+  else
+    if test x"$pgac_armv8_crypto_intrinsics" = x"yes"; then
+      USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18145,6 +18232,26 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM CRYPTO Extension" >&5
+$as_echo_n "checking whether to use ARM CRYPTO Extension... " >&6; }
+if test x"$USE_ARMV8_CRYPTO" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRYPTO 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  if test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: with a runtime check" >&5
+$as_echo "with a runtime check" >&6; }
+  else
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..297b4cf81a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2099,6 +2099,16 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
   PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
 fi
 
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_CRYPTO is set if the extra flag is required.
+PGAC_ARMV8_CRYPTO_INTRINSICS([])
+if test x"$pgac_armv8_crypto_intrinsics" != x"yes"; then
+  PGAC_ARMV8_CRYPTO_INTRINSICS([-march=armv8-a+crypto])
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -2157,6 +2167,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
     fi
   fi
 fi
+ 
+# Use ARM CRYPTO Extension if available.
+if test x"$USE_ARMV8_CRYPTO" = x"" && test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x""; then
+  if test x"$pgac_armv8_crypto_intrinsics" = x"yes" && test x"$CFLAGS_CRYPTO" = x""; then
+    USE_ARMV8_CRYPTO=1
+  else
+    if test x"$pgac_armv8_crypto_intrinsics" = x"yes"; then
+      USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
 
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
@@ -2195,6 +2216,18 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM CRYPTO Extension])
+if test x"$USE_ARMV8_CRYPTO" = x"1"; then
+  AC_DEFINE(USE_ARMV8_CRYPTO, 1, [Define to 1 to use ARMv8 CRYPTO Extension.])
+  AC_MSG_RESULT(yes)
+else
+  if test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRYPTO Extension with a runtime check.])
+    AC_MSG_RESULT(with a runtime check)
+  else
+    AC_MSG_RESULT(no)
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..bd7a2c824e 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,40 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# If we are targeting an ARM processor that has the vmull_p64 instruction
+# that is part of the ARMv8 CRYPTO Extension, use them. And if we're not
+# targeting such a processor, but can nevertheless produce code that
+# uses the vmull_p64 instruction, compile both, and select at runtime.
+###############################################################
+
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 without -march=armv8-a+crypto',
+      args: test_c_args)
+    # Use ARM CRYPTO Extension
+    cdata.set('USE_ARMV8_CRYPTO', 1)
+  elif cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM CRYPTO Extension, with runtime check
+    cdata.set('USE_ARMV8_CRYPTO', false)
+    cdata.set('USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK', 1)
+  endif
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..b0fceaf2ab 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,12 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 CRYPTO Extension. */
+#undef USE_ARMV8_CRYPTO
+
+/* Define to 1 to use ARMv8 CRYPTO Extension with a runtime check. */
+#undef USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..2c366cbf64 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -50,13 +50,20 @@ typedef uint32 pg_crc32c;
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
 #elif defined(USE_ARMV8_CRC32C)
+
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+#ifdef USE_ARMV8_CRYPTO
+/* Use ARMv8 CRC Extension and CRYPTO Extentsion instructions. */
+#define COMP_CRC32C(crc, data, len)							\
+	((crc) = pg_comp_crc32c_armv8_parallel((crc), (data), (len)))
+extern pg_crc32c pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len);
+#else
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -85,6 +92,7 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #endif
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len);
 #endif
 
 #else
diff --git a/src/port/Makefile b/src/port/Makefile
index f205c2c9c5..e59e097e03 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,11 +89,6 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..1686adb531 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -88,7 +88,7 @@ replace_funcs_pos = [
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..478b51afab 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,13 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+__attribute__((target("+crc+crypto")))
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +51,44 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+#if defined(USE_ARMV8_CRYPTO) || defined(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK)
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+#endif
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +114,19 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if (defined(USE_ARMV8_CRC32C) && defined(USE_ARMV8_CRYPTO)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+#if (defined(USE_ARMV8_CRC32C) && !defined(USE_ARMV8_CRYPTO)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
+#endif
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..2229faecad 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -26,6 +26,7 @@
 
 #include <setjmp.h>
 #include <signal.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
@@ -77,6 +78,40 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+__attribute__((target("+crypto")))
+static bool
+pg_vmull_armv8_available(void)
+{
+#if defined(USE_ARMV8_CRYPTO)
+	return true;
+#elif defined(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK)
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+#else
+	return false;
+#endif
+}
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +120,20 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
-		pg_comp_crc32c = pg_comp_crc32c_armv8;
+	{
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8_parallel;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#25

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#24)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Nov 22, 2023 at 10:16:44AM +0000, Xiang Gao wrote:

On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported,
__attribute__ is also supported.

IMHO we should stick with CFLAGS_CRC for now. If we want to switch to
using __attribute__((target("..."))), I think we should do so in a separate
patch. We are cautious about checking the availability of an attribute
before using it (see c.h), and IIUC we'd need to verify that this works for
all supported compilers that can target ARM before removing CFLAGS_CRC
here.

In addition, I am not sure about the source file pg_crc32c_armv8.c, if
CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it
be expressed in the makefile?

pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#26

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#25)

RE: CRC32C Parallel Computation Optimization on ARM

On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:

On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported,
__attribute__ is also supported.

IMHO we should stick with CFLAGS_CRC for now. If we want to switch to
using __attribute__((target("..."))), I think we should do so in a separate
patch. We are cautious about checking the availability of an attribute
before using it (see c.h), and IIUC we'd need to verify that this works for
all supported compilers that can target ARM before removing CFLAGS_CRC
here.

I agree.

In addition, I am not sure about the source file pg_crc32c_armv8.c, if
CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it
be expressed in the makefile?

pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

It does not work correctly. CFLAGS ='-march=armv8-a+crc, -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.

We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,

If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
fi

then in makefile,
pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

And same thing in meson.build. In src/port/meson.build,

replace_funcs_pos = [
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C', 'crc_crypto'],
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
]
'pg_crc32c_armv8' also needs 'crc_crypto' when 'USE_ARMV8_CRC32C'.

Looking forward to your feedback, thanks!

#27

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#26)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Nov 23, 2023 at 08:05:26AM +0000, Xiang Gao wrote:

On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:

pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

It does not work correctly. CFLAGS ='-march=armv8-a+crc,
-march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.

We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,

If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
fi

then in makefile,
pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

Ah, I see. We need to append +crc and/or +crypto based on what the
compiler understands. That seems fine to me...

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#28

Xiang Gao

Xiang.Gao@arm.com

about 2 years ago

In reply to: Nathan Bossart (#27)

1 attachment(s)

RE: CRC32C Parallel Computation Optimization on ARM

On Date: Thu, 30 Nov 2023 14:54:26PM -0600, Nathan Bossart wrote:

pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

It does not work correctly. CFLAGS ='-march=armv8-a+crc,
-march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.

We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,

If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
fi

then in makefile,
pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

Ah, I see. We need to append +crc and/or +crypto based on what the
compiler understands. That seems fine to me...

This is the latest patch. Looking forward to your feedback, thanks!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachments:

0008-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0008-crc32c-parallel-computation-optimization-on-arm.patchDownload

From 02a373e2cbd05d8e63b9eb92ee883ac31abf2053 Mon Sep 17 00:00:00 2001
From: "xiang.gao" <xiang.gao@arm.com>
Date: Wed, 13 Sep 2023 15:13:37 +0800
Subject: [PATCH] PostgreSQL: CRC32C optimization

Crc32c Parallel computation optimization
Algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper
Input data is divided into three equal-sized blocks.
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes. One Block: 42(BLK_LEN) * 8 bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

Signed-off-by: xiang.gao <xiang.gao@arm.com>
Change-Id: If876bbca5bbc3940946a7d72e14fe9fdf54682c1
---
 config/c-compiler.m4              |  26 +++++++
 configure                         | 117 ++++++++++++++++++++++++++++++
 configure.ac                      |  41 +++++++++++
 meson.build                       |  46 ++++++++++++
 src/Makefile.global.in            |   2 +
 src/include/pg_config.h.in        |   6 ++
 src/include/port/pg_crc32c.h      |  14 +++-
 src/makefiles/meson.build         |   2 +
 src/port/Makefile                 |  13 +++-
 src/port/meson.build              |  10 +--
 src/port/pg_crc32c_armv8.c        |  60 ++++++++++++++-
 src/port/pg_crc32c_armv8_choose.c |  51 ++++++++++++-
 12 files changed, 371 insertions(+), 17 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..89fc31e12a 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -662,6 +662,32 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
 
+# PGAC_ARMV8_CRYPTO_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the PMULL instructions
+# using the vmull_p64 intrinsic function. These instructions
+# were first introduced in ARMv8 CRYPTO Extension.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+crypto). If the intrinsics are supported, sets
+# pgac_armv8_crypto_intrinsics, and CFLAGS_CRYPTO.
+AC_DEFUN([PGAC_ARMV8_CRYPTO_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_armv8_crypto_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for vmull_p64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_neon.h>],
+  [return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRYPTO="$1"
+  pgac_armv8_crypto_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARMV8_CRYPTO_INTRINSICS
+
 # PGAC_LOONGARCH_CRC32C_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the LoongArch CRCC instructions, using
diff --git a/configure b/configure
index cfd968235f..21e1ef077d 100755
--- a/configure
+++ b/configure
@@ -646,6 +646,8 @@ MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
+CFLAGS_CRC_CRYPTO
+CFLAGS_CRYPTO
 CFLAGS_CRC
 LIBOBJS
 OPENSSL
@@ -17995,6 +17997,82 @@ fi
 
 fi
 
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_CRYPTO is set if the extra flag is required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=... " >&6; }
+if ${pgac_cv_armv8_crypto_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crypto_intrinsics_=yes
+else
+  pgac_cv_armv8_crypto_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crypto_intrinsics_" >&5
+$as_echo "$pgac_cv_armv8_crypto_intrinsics_" >&6; }
+if test x"$pgac_cv_armv8_crypto_intrinsics_" = x"yes"; then
+  CFLAGS_CRYPTO=""
+  pgac_armv8_crypto_intrinsics=yes
+fi
+
+if test x"$pgac_armv8_crypto_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto" >&5
+$as_echo_n "checking for vmull_p64 with CFLAGS=-march=armv8-a+crypto... " >&6; }
+if ${pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+crypto"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_neon.h>
+int
+main ()
+{
+return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto=yes
+else
+  pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" >&5
+$as_echo "$pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" >&6; }
+if test x"$pgac_cv_armv8_crypto_intrinsics__march_armv8_apcrypto" = x"yes"; then
+  CFLAGS_CRYPTO="-march=armv8-a+crypto"
+  pgac_armv8_crypto_intrinsics=yes
+fi
+
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -18038,6 +18116,14 @@ fi
 
 
 
+
+# Set a new CFLAGS, cflags_crc_crypto for files that
+# require both ARMv8 CRC Extension and ARMv8 CRYPTO Extension.
+if test x"$CFLAGS_CRC" != x"" || test x"$CFLAGS_CRYPTO" != x""; then
+  CFLAGS_CRC_CRYPTO='-march=armv8-a+crc+crypto'
+fi
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -18089,6 +18175,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
   fi
 fi
 
+# Use ARM CRYPTO Extension if available.
+if test x"$USE_ARMV8_CRYPTO" = x"" && test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x""; then
+  if test x"$pgac_armv8_crypto_intrinsics" = x"yes" && test x"$CFLAGS_CRYPTO" = x""; then
+    USE_ARMV8_CRYPTO=1
+  else
+    if test x"$pgac_armv8_crypto_intrinsics" = x"yes"; then
+      USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
+
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
@@ -18145,6 +18242,26 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to use ARM CRYPTO Extension" >&5
+$as_echo_n "checking whether to use ARM CRYPTO Extension... " >&6; }
+if test x"$USE_ARMV8_CRYPTO" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRYPTO 1" >>confdefs.h
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+else
+  if test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: with a runtime check" >&5
+$as_echo "with a runtime check" >&6; }
+  else
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index f220b379b3..e76cf119ad 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2099,6 +2099,16 @@ if test x"$pgac_armv8_crc32c_intrinsics" != x"yes"; then
   PGAC_ARMV8_CRC32C_INTRINSICS([-march=armv8-a+crc])
 fi
 
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# First check if vmull_p64 intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+crypto flag helps.
+# CFLAGS_CRYPTO is set if the extra flag is required.
+PGAC_ARMV8_CRYPTO_INTRINSICS([])
+if test x"$pgac_armv8_crypto_intrinsics" != x"yes"; then
+  PGAC_ARMV8_CRYPTO_INTRINSICS([-march=armv8-a+crypto])
+fi
+
 # Check for LoongArch CRC intrinsics to do CRC calculations.
 #
 # Check if __builtin_loongarch_crcc_* intrinsics can be used
@@ -2106,6 +2116,14 @@ fi
 PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
+AC_SUBST(CFLAGS_CRYPTO)
+
+# Set a new CFLAGS, cflags_crc_crypto for files that
+# require both ARMv8 CRC Extension and ARMv8 CRYPTO Extension.
+if test x"$CFLAGS_CRC" != x"" || test x"$CFLAGS_CRYPTO" != x""; then
+  CFLAGS_CRC_CRYPTO='-march=armv8-a+crc+crypto'
+fi
+AC_SUBST(CFLAGS_CRC_CRYPTO)
 
 # Select CRC-32C implementation.
 #
@@ -2157,6 +2175,17 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
     fi
   fi
 fi
+ 
+# Use ARM CRYPTO Extension if available.
+if test x"$USE_ARMV8_CRYPTO" = x"" && test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x""; then
+  if test x"$pgac_armv8_crypto_intrinsics" = x"yes" && test x"$CFLAGS_CRYPTO" = x""; then
+    USE_ARMV8_CRYPTO=1
+  else
+    if test x"$pgac_armv8_crypto_intrinsics" = x"yes"; then
+      USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi
 
 # Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 AC_MSG_CHECKING([which CRC-32C implementation to use])
@@ -2195,6 +2224,18 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+AC_MSG_CHECKING([whether to use ARM CRYPTO Extension])
+if test x"$USE_ARMV8_CRYPTO" = x"1"; then
+  AC_DEFINE(USE_ARMV8_CRYPTO, 1, [Define to 1 to use ARMv8 CRYPTO Extension.])
+  AC_MSG_RESULT(yes)
+else
+  if test x"$USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRYPTO Extension with a runtime check.])
+    AC_MSG_RESULT(with a runtime check)
+  else
+    AC_MSG_RESULT(no)
+  fi
+fi
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/meson.build b/meson.build
index 2d516c8f37..08d86c3ab0 100644
--- a/meson.build
+++ b/meson.build
@@ -2101,6 +2101,52 @@ endif
 
 
 
+###############################################################
+# Check for ARMv8 CRYPTO Extension intrinsics to do polynomial multiplication
+#
+# If we are targeting an ARM processor that has the vmull_p64 instruction
+# that is part of the ARMv8 CRYPTO Extension, use them. And if we're not
+# targeting such a processor, but can nevertheless produce code that
+# uses the vmull_p64 instruction, compile both, and select at runtime.
+###############################################################
+cflags_crypto = []
+if (host_cpu == 'arm' or host_cpu == 'aarch64')
+
+  prog = '''
+#include <arm_neon.h>
+
+int main(void)
+{
+    return ((uint64_t)vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+}
+'''
+
+  if cc.links(prog, name: 'vmull_p64 without -march=armv8-a+crypto',
+      args: test_c_args)
+    # Use ARM CRYPTO Extension
+    cdata.set('USE_ARMV8_CRYPTO', 1)
+  elif cc.links(prog, name: 'vmull_p64 with -march=armv8-a+crypto',
+      args: test_c_args + ['-march=armv8-a+crypto'])
+    # Use ARM CRYPTO Extension, with runtime check
+    cflags_crypto += '-march=armv8-a+crypto'
+    cdata.set('USE_ARMV8_CRYPTO', false)
+    cdata.set('USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK', 1)
+  endif
+endif
+
+
+
+###############################################################
+# Set a new CFLAGS, cflags_crc_crypto for files that
+# require both ARMv8 CRC Extension and ARMv8 CRYPTO Extension.
+###############################################################
+cflags_crc_crypto = []
+if cflags_crc.length() != 0 or cflags_crypto.length() != 0
+  cflags_crc_crypto += '-march=armv8-a+crc+crypto'
+endif
+
+
+
 ###############################################################
 # Other CPU specific stuff
 ###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 7b66590801..724dd5c40e 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,8 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
 CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_CRYPTO = @CFLAGS_CRYPTO@
+CFLAGS_CRC_CRYPTO = @CFLAGS_CRC_CRYPTO@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 CXXFLAGS = @CXXFLAGS@
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..b0fceaf2ab 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -689,6 +689,12 @@
 /* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
 #undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use ARMv8 CRYPTO Extension. */
+#undef USE_ARMV8_CRYPTO
+
+/* Define to 1 to use ARMv8 CRYPTO Extension with a runtime check. */
+#undef USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..2c366cbf64 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -50,13 +50,20 @@ typedef uint32 pg_crc32c;
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
 #elif defined(USE_ARMV8_CRC32C)
+
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+#ifdef USE_ARMV8_CRYPTO
+/* Use ARMv8 CRC Extension and CRYPTO Extentsion instructions. */
+#define COMP_CRC32C(crc, data, len)							\
+	((crc) = pg_comp_crc32c_armv8_parallel((crc), (data), (len)))
+extern pg_crc32c pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len);
+#else
 /* Use ARMv8 CRC Extension instructions. */
-
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
 #elif defined(USE_LOONGARCH_CRC32C)
 /* Use LoongArch CRCC instructions. */
@@ -85,6 +92,7 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #endif
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len);
 #endif
 
 #else
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index be946f7b38..70e1329fda 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -99,6 +99,8 @@ pgxs_kv = {
     ' '.join(cflags_no_decl_after_statement),
 
   'CFLAGS_CRC': ' '.join(cflags_crc),
+  'CFLAGS_CRYPTO': ' '.join(cflags_crypto),
+  'CFLAGS_CRC_CRYPTO': ' '.join(cflags_crc_crypto),
   'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
   'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
 
diff --git a/src/port/Makefile b/src/port/Makefile
index f205c2c9c5..d722746f9c 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,10 +89,15 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_crc32c_armv8.o need CFLAGS_CRC_CRYPTO
+pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC_CRYPTO)
+pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC_CRYPTO)
+pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC_CRYPTO)
+
+# all versions of pg_crc32c_armv8_choose.o need CFLAGS_CRYPTO
+pg_crc32c_armv8_choose.o: CFLAGS+=$(CFLAGS_CRYPTO)
+pg_crc32c_armv8_choose_shlib.o: CFLAGS+=$(CFLAGS_CRYPTO)
+pg_crc32c_armv8_choose_srv.o: CFLAGS+=$(CFLAGS_CRYPTO)
 
 #
 # Shared library versions of object files
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..d3a47a1dbc 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -87,9 +87,9 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
 
   # arm / aarch64
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C', 'crc_crypto'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
+  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crypto'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
   # loongarch
@@ -99,8 +99,8 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'crypto': cflags_crypto, 'crc_crypto': cflags_crc_crypto}
+pgport_sources_cflags = {'crc': [], 'crypto': [], 'crc_crypto': []}
 
 foreach f : replace_funcs_neg
   func = f.get(0)
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d8fae510cf..1254dc1c3e 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -2,6 +2,7 @@
  *
  * pg_crc32c_armv8.c
  *	  Compute CRC-32C checksum using ARMv8 CRC Extension instructions
+ *	  with ARMv8 VMULL Extentsion instructions or not
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -15,11 +16,12 @@
 #include "c.h"
 
 #include <arm_acle.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
-pg_crc32c
-pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+static inline pg_crc32c
+pg_comp_crc32c_helper(pg_crc32c crc, const void *data, size_t len, bool use_vmull)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -48,6 +50,44 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	if (use_vmull)
+	{
+#if defined(USE_ARMV8_CRYPTO) || defined(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK)
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */
+		while (p + 1024 <= pend)
+		{
+#define BLOCK_LEN 42
+			const uint64_t *in64 = (const uint64_t *) (p);
+			uint32_t	crc0 = crc,
+						crc1 = 0,
+						crc2 = 0;
+
+			for (int i = 0; i < BLOCK_LEN; i++, in64++)
+			{
+				crc0 = __crc32cd(crc0, *(in64));
+				crc1 = __crc32cd(crc1, *(in64 + BLOCK_LEN));
+				crc2 = __crc32cd(crc2, *(in64 + BLOCK_LEN * 2));
+			}
+			in64 += BLOCK_LEN * 2;
+			crc0 = __crc32cd(0, vmull_p64(crc0, 0xcec3662e));
+			crc1 = __crc32cd(0, vmull_p64(crc1, 0xa60ce07b));
+			crc = crc0 ^ crc1 ^ crc2;
+
+			crc = __crc32cd(crc, *in64++);
+			crc = __crc32cd(crc, *in64++);
+
+			p += 1024;
+#undef BLOCK_LEN
+		}
+#endif
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -73,3 +113,19 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+#if (defined(USE_ARMV8_CRC32C) && defined(USE_ARMV8_CRYPTO)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_armv8_parallel(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, true);
+}
+#endif
+
+#if (defined(USE_ARMV8_CRC32C) && !defined(USE_ARMV8_CRYPTO)) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+pg_crc32c
+pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	return pg_comp_crc32c_helper(crc, data, len, false);
+}
+#endif
diff --git a/src/port/pg_crc32c_armv8_choose.c b/src/port/pg_crc32c_armv8_choose.c
index 0fdddccaf7..4b1d824dbb 100644
--- a/src/port/pg_crc32c_armv8_choose.c
+++ b/src/port/pg_crc32c_armv8_choose.c
@@ -4,8 +4,8 @@
  *	  Choose between ARMv8 and software CRC-32C implementation.
  *
  * On first call, checks if the CPU we're running on supports the ARMv8
- * CRC Extension. If it does, use the special instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
+ * CRC Extension and VMULL Extension. If it does, use the special instructions
+ * for CRC-32C computation. Otherwise, fall back to the pure software implementation
  * (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
@@ -26,6 +26,7 @@
 
 #include <setjmp.h>
 #include <signal.h>
+#include <arm_neon.h>
 
 #include "port/pg_crc32c.h"
 
@@ -77,6 +78,39 @@ pg_crc32c_armv8_available(void)
 	return (result > 0);
 }
 
+static bool
+pg_vmull_armv8_available(void)
+{
+#if defined(USE_ARMV8_CRYPTO)
+	return true;
+#elif defined(USE_ARMV8_CRYPTO_WITH_RUNTIME_CHECK)
+	int			result;
+
+	pqsignal(SIGILL, illegal_instruction_handler);
+	if (sigsetjmp(illegal_instruction_jump, 1) == 0)
+	{
+		result = ((uint64_t) vmull_p64(0x12345678, 0x9abcde01) == 0x8860e9abc170678);
+	}
+	else
+	{
+		/* We got the SIGILL trap */
+		result = -1;
+	}
+	pqsignal(SIGILL, SIG_DFL);
+
+#ifndef FRONTEND
+	/* We don't expect this case, so complain loudly */
+	if (result == 0)
+		elog(ERROR, "vmull_p64 hardware results error");
+
+	elog(DEBUG1, "using armv8 vmull_p64 hardware = %d", (result > 0));
+#endif
+	return (result > 0);
+#else
+	return false;
+#endif
+}
+
 /*
  * This gets called on the first call. It replaces the function pointer
  * so that subsequent calls are routed directly to the chosen implementation.
@@ -85,9 +119,20 @@ static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_armv8_available())
-		pg_comp_crc32c = pg_comp_crc32c_armv8;
+	{
+		if (pg_vmull_armv8_available())
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8_parallel;
+		}
+		else
+		{
+			pg_comp_crc32c = pg_comp_crc32c_armv8;
+		}
+	}
 	else
+	{
 		pg_comp_crc32c = pg_comp_crc32c_sb8;
+	}
 
 	return pg_comp_crc32c(crc, data, len);
 }
-- 
2.34.1

#29

Nathan Bossart

nathandbossart@gmail.com

about 2 years ago

In reply to: Xiang Gao (#28)

Re: CRC32C Parallel Computation Optimization on ARM

On Mon, Dec 04, 2023 at 07:27:01AM +0000, Xiang Gao wrote:

This is the latest patch. Looking forward to your feedback, thanks!

Thanks for the new patch. I am hoping to spend much more time on this in
the near future...

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#30

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Nathan Bossart (#29)

Re: CRC32C Parallel Computation Optimization on ARM

On Mon, Dec 04, 2023 at 10:18:09PM -0600, Nathan Bossart wrote:

Thanks for the new patch. I am hoping to spend much more time on this in
the near future...

Hi,

The patch looks interesting, having around 8% improvement on that sounds
attractive. Nathan, do you plan to come back to it and finish the
review?

One side note, I think it would be great to properly cite the white
paper the patch is referring to. Besides paying some respect to the
authors, it will also make it easier to actually find it. After a quick
search I found only some references to [1]http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf, but this link doesn't seem
to be available anymore.

[1]: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

#31

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: Xiang Gao (#28)

Re: CRC32C Parallel Computation Optimization on ARM

On Mon, Dec 4, 2023 at 2:27 PM Xiang Gao <Xiang.Gao@arm.com> wrote:

[v8 patch]

I have a couple quick thoughts on this:

1. I looked at a couple implementations of this idea, and found that
the constants used in the carryless multiply are tied to the length of
the blocks. With a lookup table we can do the 3-way algorithm on any
portion of a full block length, rather than immediately fall to doing
CRC serially. That would be faster on average. See for example
https://github.com/komrad36/CRC/tree/master , but I don't think we
actually have to fully unroll the loop like they do there.

2. With the above, we can use a larger full block size, and so on
average less time would be spent in the carryless multiply. With that,
we could possibly get away with an open coded loop in normal C rather
than a new intrinsic (also found in the above repo). That would be
more portable.

--
John Naylor
Amazon Web Services.

#32

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: John Naylor (#31)

2 attachment(s)

Re: CRC32C Parallel Computation Optimization on ARM

I wrote:

1. I looked at a couple implementations of this idea, and found that
the constants used in the carryless multiply are tied to the length of
the blocks. With a lookup table we can do the 3-way algorithm on any
portion of a full block length, rather than immediately fall to doing
CRC serially. That would be faster on average. See for example
https://github.com/komrad36/CRC/tree/master , but I don't think we
actually have to fully unroll the loop like they do there.

2. With the above, we can use a larger full block size, and so on
average less time would be spent in the carryless multiply. With that,
we could possibly get away with an open coded loop in normal C rather
than a new intrinsic (also found in the above repo). That would be
more portable.

I added a port to x86 and poked at it, with the intent to have an easy
on-ramp to that at least accelerates computation of CRCs on FPIs.

The 0008 patch only worked on chunks of 1024 at a time. At that size,
the presence of hardware carryless multiplication is not that
important. I removed the hard-coded constants in favor of a lookup
table, so now it can handle anything up to 8400 bytes in a single
pass.

There are still some "taste" issues, but I like the overall shape here
and how light it was. With more hardware support, we can go much lower
than 1024 bytes, but that can be left for future work.
--
John Naylor
Amazon Web Services

Attachments:

v9-0002-Implement-interleaved-CRC-calculation-combined-vi.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Implement-interleaved-CRC-calculation-combined-vi.patchDownload

From b889235866ad2593e7d093e6a74181a821b68430 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 11 Dec 2024 11:30:33 +0700
Subject: [PATCH v9 2/2] Implement interleaved CRC calculation, combined via
 carryless multiplication

Changes from 0008:
- Use a lookup table to get the precomputed CRC instead of using constants.
This allows a single pass to handle up to 8400 bytes before combining CRCs.
On large inputs, we can just use a simple loop to emulate CLMUL
- Both Arm and x86 support, with no additional config or runtime checks.

Xiang Gao and John Naylor
---
 src/include/port/pg_crc32c.h | 29 +++++++++++
 src/port/pg_crc32c_armv8.c   | 34 +++++++++++++
 src/port/pg_crc32c_sb8.c     | 94 ++++++++++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42.c   | 44 ++++++++++++++++-
 4 files changed, 199 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..aa06bc201a 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -33,6 +33,7 @@
 #ifndef PG_CRC32C_H
 #define PG_CRC32C_H
 
+#include "port/pg_bitutils.h"
 #include "port/pg_bswap.h"
 
 typedef uint32 pg_crc32c;
@@ -107,4 +108,32 @@ extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len)
 
 #endif
 
+
+/* semi-private to files in src/port that interleave CRC instructions */
+/* WIP: separate header in src/port? */
+
+#define CRC_BYTES_PER_ITER (3 * sizeof(uint64))
+#define CRC_MAX_BLOCK_LEN 350 /* can compute 8k inputs in a single pass */
+
+extern PGDLLIMPORT const uint64 combine_crc_lookup[CRC_MAX_BLOCK_LEN];
+
+/*
+ * Fallback for platforms that don't support intrinsics for carryless multiplication.
+ */
+static inline uint64
+pg_clmul(uint32 a, uint32 b)
+{
+	uint64		result = 0;
+
+	while (a)
+	{
+		int			pos = pg_rightmost_one_pos32(a);
+
+		result ^= (uint64) b << (pos);
+		a &= a - 1;
+	}
+
+	return result;
+}
+
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index d47d838c50..f9f155b65e 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -23,6 +23,8 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
+	const size_t min_blocklen = 42; /* Min size to consider interleaving */
+	const pg_crc32c orig_crc = crc; // XXX not for commit
 
 	/*
 	 * ARMv8 doesn't require alignment, but aligned memory access is
@@ -48,6 +50,36 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	/* See pg_crc32c_sse42.c for explanation */
+	while (p + min_blocklen * CRC_BYTES_PER_ITER <= pend)
+	{
+		const size_t block_len = Min(CRC_MAX_BLOCK_LEN, (pend - p) / CRC_BYTES_PER_ITER);
+		const uint64 *in64 = (const uint64 *) (p);
+		pg_crc32c	crc0 = crc,
+					crc1 = 0,
+					crc2 = 0;
+		uint64		mul0,
+					mul1,
+					precompute;
+
+		for (int i = 0; i < block_len; i++, in64++)
+		{
+			crc0 = __crc32cd(crc0, *(in64));
+			crc1 = __crc32cd(crc1, *(in64 + block_len));
+			crc2 = __crc32cd(crc2, *(in64 + block_len * 2));
+		}
+
+		precompute = combine_crc_lookup[block_len - 1];
+		mul0 = pg_clmul(crc0, (uint32) precompute);
+		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+
+		crc0 = __crc32cd(0, mul0);
+		crc1 = __crc32cd(0, mul1);
+		crc = crc0 ^ crc1 ^ crc2;
+
+		p += block_len * CRC_BYTES_PER_ITER;
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -71,5 +103,7 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		crc = __crc32cb(crc, *p);
 	}
 
+	// XXX not for commit
+	Assert(crc == pg_comp_crc32c_sb8(orig_crc, data, len));
 	return crc;
 }
diff --git a/src/port/pg_crc32c_sb8.c b/src/port/pg_crc32c_sb8.c
index eda20fa747..48afd8c3d3 100644
--- a/src/port/pg_crc32c_sb8.c
+++ b/src/port/pg_crc32c_sb8.c
@@ -1167,3 +1167,97 @@ static const uint32 pg_crc32c_table[8][256] = {
 	}
 #endif							/* WORDS_BIGENDIAN */
 };
+
+/* Lookup table for combining interleaved CRC computations */
+/*  WIP: it seems not great to use _sb8.c for semi-common helpers for files CRC hardware support? */
+const uint64 combine_crc_lookup[CRC_MAX_BLOCK_LEN] =
+{
+	0x00000001493c7d27, 0x493c7d27ba4fc28e, 0xf20c0dfeddc0152b, 0xba4fc28e9e4addf8,
+	0x3da6d0cb39d3b296, 0xddc0152b0715ce53, 0x1c291d0447db8317, 0x9e4addf80d3b6092,
+	0x740eef02c96cfdc0, 0x39d3b296878a92a7, 0x083a6eecdaece73e, 0x0715ce53ab7aff2a,
+	0xc49f4f672162d385, 0x47db831783348832, 0x2ad91c30299847d5, 0x0d3b6092b9e02b86,
+	0x6992cea218b33a4e, 0xc96cfdc0b6dd949b, 0x7e90804878d9ccb7, 0x878a92a7bac2fd7b,
+	0x1b3d8f29a60ce07b, 0xdaece73ece7f39f4, 0xf1d0f55e61d82e56, 0xab7aff2ad270f1a2,
+	0xa87ab8a8c619809d, 0x2162d3852b3cac5d, 0x8462d80065863b64, 0x833488321b03397f,
+	0x71d111a8ebb883bd, 0x299847d5b3e32c28, 0xffd852c6064f7f26, 0xb9e02b86dd7e3b0c,
+	0xdcb17aa4f285651c, 0x18b33a4e10746f3c, 0xf37c5aeec7a68855, 0xb6dd949b271d9844,
+	0x6051d5a28e766a0c, 0x78d9ccb793a5f730, 0x18b0d4ff6cb08e5c, 0xbac2fd7b6b749fb2,
+	0x21f3d99c1393e203, 0xa60ce07bcec3662e, 0x8f15801496c515bb, 0xce7f39f4e6fc4e6a,
+	0xa00457f78227bb8a, 0x61d82e56b0cd4768, 0x8d6d2c4339c7ff35, 0xd270f1a2d7a4825c,
+	0x00ac29cf0ab3844b, 0xc619809d0167d312, 0xe9adf796f6076544, 0x2b3cac5d26f6a60a,
+	0x96638b34a741c1bf, 0x65863b6498d8d9cb, 0xe0e9f35149c3cc9c, 0x1b03397f68bce87a,
+	0x9af01f2d57a3d037, 0xebb883bd6956fc3b, 0x2cff42cf42d98888, 0xb3e32c283771e98f,
+	0x88f25a3ab42ae3d9, 0x064f7f262178513a, 0x4e36f0b0e0ac139e, 0xdd7e3b0c170076fa,
+	0xbd6f81f8444dd413, 0xf285651c6f345e45, 0x91c9bd4b41d17b64, 0x10746f3cff0dba97,
+	0x885f087ba2b73df1, 0xc7a68855f872e54c, 0x4c1449321e41e9fc, 0x271d984486d8e4d2,
+	0x52148f02651bd98b, 0x8e766a0c5bb8f1bc, 0xa3c6f37aa90fd27a, 0x93a5f730b3af077a,
+	0xd7c0557f4984d782, 0x6cb08e5cca6ef3ac, 0x63ded06a234e0b26, 0x6b749fb2dd66cbbb,
+	0x4d56973c4597456a, 0x1393e203e9e28eb4, 0x9669c9df7b3ff57a, 0xcec3662ec9c8b782,
+	0xe417f38a3f70cc6f, 0x96c515bb93e106a4, 0x4b9e0f7162ec6c6d, 0xe6fc4e6ad813b325,
+	0xd104b8fc0df04680, 0x8227bb8a2342001e, 0x5b3977300a2a8d7e, 0xb0cd47686d9a4957,
+	0xe78eb416e8b6368b, 0x39c7ff35d2c3ed1a, 0x61ff0e01995a5724, 0xd7a4825c9ef68d35,
+	0x8d96551c0c139b31, 0x0ab3844bf2271e60, 0x0bf80dd20b0bf8ca, 0x0167d3122664fd8b,
+	0x8821abeded64812d, 0xf607654402ee03b2, 0x6a45d2b28604ae0f, 0x26f6a60a363bd6b3,
+	0xd8d26619135c83fd, 0xa741c1bf5fabe670, 0xde87806c35ec3279, 0x98d8d9cb00bcf5f6,
+	0x143387548ae00689, 0x49c3cc9c17f27698, 0x5bd2011f58ca5f00, 0x68bce87aaa7c7ad5,
+	0xdd07448eb5cfca28, 0x57a3d037ded288f8, 0xdde8f5b959f229bc, 0x6956fc3b6d390dec,
+	0xa3e3e02c37170390, 0x42d988886353c1cc, 0xd73c7beac4584f5c, 0x3771e98ff48642e9,
+	0x80ff0093531377e2, 0xb42ae3d9dd35bc8d, 0x8fe4c34db25b29f2, 0x2178513a9a5ede41,
+	0xdf99fc11a563905d, 0xe0ac139e45cddf4e, 0x6c23e841acfa3103, 0x170076faa51b6135,
+	0xfe314258dfd94fb2, 0x444dd41380f2886b, 0x0d8373a067969a6a, 0x6f345e45021ac5ef,
+	0x19e3635ee8310afa, 0x41d17b6475451b04, 0x29f268b48e1450f7, 0xff0dba97cbbe4ee1,
+	0x1dc0632a3a83de21, 0xa2b73df1e0cdcf86, 0x1614f396453c1679, 0xf872e54cdefba41c,
+	0x9e2993d3613eee91, 0x1e41e9fcddaf5114, 0x6bebd73c1f1dd124, 0x86d8e4d2bedc6ba1,
+	0x63ae91e6eca08ffe, 0x651bd98b3ae30875, 0xf8c9da7a0cd1526a, 0x5bb8f1bcb1630f04,
+	0x945a19c1ff47317b, 0xa90fd27ad6c3a807, 0xee8213b79a7781e0, 0xb3af077a63d097e9,
+	0x93781dc71d31175f, 0x4984d78294eb256e, 0xccc4a1b913184649, 0xca6ef3ac4be7fd90,
+	0xa2c2d9717d5c1d64, 0x234e0b2680ba859a, 0x1cad44526eeed1c9, 0xdd66cbbb22c3799f,
+	0x74922601d8ecc578, 0x4597456ab3a6da94, 0xc55f7eabcaf933fe, 0xe9e28eb450bfaade,
+	0xa19623292e7d11a7, 0x7b3ff57a7d14748f, 0x2d37074932d8041c, 0xc9c8b782889774e1,
+	0x397d84a16cc8a0ff, 0x3f70cc6f5aa1f3cf, 0x791132708a074012, 0x93e106a433bc58b3,
+	0xbc8178039f2b002a, 0x62ec6c6dbd0bb25f, 0x88eb3c0760bf0a6a, 0xd813b3258515c07f,
+	0x6e4cb6303be3c09b, 0x0df04680d8440525, 0x71971d5c682d085d, 0x2342001e465a4eee,
+	0xf33b8bc628b5de82, 0x0a2a8d7e077d54e0, 0x9fb3bbc02e5f3c8c, 0x6d9a4957c00df280,
+	0x6ef22b23d0a37f43, 0xe8b6368ba52f58ec, 0xce2df76800712e86, 0xd2c3ed1ad6748e82,
+	0xe53a4fc747972100, 0x995a572451aeef66, 0xbe60a91a71900712, 0x9ef68d35359674f7,
+	0x1dfa0a15647fbd15, 0x0c139b311baaa809, 0x8ec52396469aef86, 0xf2271e6086d42d06,
+	0x0e766b114aba1470, 0x0b0bf8ca1c2cce0a, 0x475846a4aa0cd2d3, 0x2664fd8bf8448640,
+	0xb2a3dfa6ac4fcdec, 0xed64812de81cf154, 0xdc1a160cc2c7385c, 0x02ee03b295ffd7dc,
+	0x79afdf1c91de6176, 0x8604ae0f84ee89ac, 0x07ac6e46533e308d, 0x363bd6b35f0e0438,
+	0x15f85253604d6e09, 0x135c83fdaeb3e622, 0x1bec24dd4263eb04, 0x5fabe67050c2cb16,
+	0x4c36cd5b6667afe7, 0x35ec32791a6889b8, 0xe0a22e29de42c92a, 0x00bcf5f67f47463d,
+	0x7c2b6ed9b82b6080, 0x8ae00689828d550b, 0x06ff88fddca2b4da, 0x17f276984ac726eb,
+	0xf7317cf0529295e6, 0x58ca5f005e9f28eb, 0x61b6e40b40c14fff, 0xaa7c7ad596a1f19b,
+	0xde8a97f8997157e1, 0xb5cfca28b0ed8196, 0x88f61445097e41e6, 0xded288f84ce8bfe5,
+	0xd4520e9ee36841ad, 0x59f229bcd1a9427c, 0x0c592bd593f3319c, 0x6d390decb58ac6fe,
+	0x38edfaf3e3809241, 0x37170390f22fd3e2, 0x72cbfcdb83c2df88, 0x6353c1ccd6b1825a,
+	0x348331a54e4ff232, 0xc4584f5c6664d9c1, 0xc3977c19836b5a6e, 0xf48642e923d5e7e5,
+	0xdafaea7c65065343, 0x531377e21495d20d, 0x73db4c04a29c82eb, 0xdd35bc8df370b37f,
+	0x72675ce8ea6dd7dc, 0xb25b29f2e9415bce, 0x3ec2ff8396309b0f, 0x9a5ede41c776b648,
+	0xe8c7a017c22c52c5, 0xa563905dcecfcd43, 0xcf4bfaefd8311ee7, 0x45cddf4e24e6fe8f,
+	0x6bde1ac7d0c6d7c9, 0xacfa310345aa5d4a, 0xae1175c2cf067065, 0xa51b613582f89c77,
+	0xf7506984a348c84e, 0xdfd94fb2d07737ea, 0xe0863e5636069dd2, 0x80f2886bc4cedd32,
+	0xd7e661ae9a97be47, 0x67969a6af45cd585, 0x01afc14f93f36e2b, 0x021ac5ef195bc82d,
+	0xd2fd8e3ce622aaca, 0xe8310afa23912612, 0xc4eb27b2a1fd0859, 0x75451b04a2edbd17,
+	0x632098732cefbfdd, 0x8e1450f7f36d84e2, 0xf29971cf9664532d, 0xcbbe4ee1cfeff4b3,
+	0xaf6939d96737eead, 0x3a83de21f52d28d3, 0x650ef6c5fb3bb2c8, 0xe0cdcf864a9d4498,
+	0x36e108faaef471c1, 0x453c16790d08bb68, 0x09c20a6c3b6c03be, 0xdefba41c4de20a7c,
+	0x0a1a6a8877792405, 0x613eee91b95a9eb0, 0x286d109b11f2bc8f, 0xddaf51147956e76a,
+	0x9fd51b88032a8058, 0x1f1dd1241b93589c, 0x4860285dcc66546f, 0xbedc6ba1005bb964,
+	0x6e221adc28198362, 0xeca08ffe3f2e57b1, 0x0e0a10735f54bb14, 0x3ae3087599e44d01,
+	0x37f194212591f073, 0x0cd1526a0871bd30, 0xe9bbf6481fb48d12, 0xb1630f043888ed03,
+	0x0fa0277f1e22167e, 0xff47317bd272cadf, 0xeb2fb89a8653215c, 0xd6c3a807f6e6d69e,
+	0x1c47ed30b9f5bf62, 0x9a7781e09dde6145, 0x271cfb40ed49c4d3, 0x63d097e91ad321bc,
+	0xea1cb6e7f206e4b8, 0x1d31175f60165873, 0x9f737f83159dff70, 0x94eb256ee1a468d0,
+	0xd4619bbcda11f51b, 0x1318464993189d18, 0x794dd0f2ac4f4691, 0x4be7fd9087d07ae0,
+	0xb6d42cd90db9f589, 0x7d5c1d64c51a240c, 0x8b9be230ab819fbb, 0x80ba859a5cc19671,
+	0xd4617a4c46183f0a, 0x6eeed1c993e7c448, 0xb9f93bd067fe6e36, 0x22c3799fca110698,
+	0xb8b67c1c8a2acc83, 0xd8ecc578eb75a090, 0xc5ca433a3e18bd99, 0xb3a6da94480e7e4d,
+	0x5e5dcd9560bced33, 0xcaf933fe0bf69a3f, 0x7b589372dbf59471, 0x50bfaadea00bae3d,
+	0x0afb7f3b4b7df256, 0x2e7d11a77959fe2a, 0x0f97c69068e3d179, 0x7d14748fa160e585,
+	0x3e254fe4cac39a0b, 0x32d8041cf4229b7a, 0x141e8512007ca0f9, 0x889774e10e0126b2,
+	0xe5e25bd082fe946e, 0x6cc8a0ff013b3856, 0xacf1231667c69966, 0x5aa1f3cffa0f2bd0,
+	0x7b454cb35d4c91fc, 0x8a07401279d95f64, 0x311709b8807121c0, 0x33bc58b3b0a3f16d,
+	0x9948a7d2618e0996, 0x9f2b002a2308bca9, 0x809cef1f343272b3, 0xbd0bb25ff5c40599,
+	0x11dde5b740c1c64c, 0x60bf0a6a71cc89e8,
+};
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index dcc4904a82..d2b435956f 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -25,15 +25,53 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
+	const size_t min_blocklen = 42; /* Min size to consider interleaving */
+	const pg_crc32c orig_crc = crc; /* XXX not for commit */
 
 	/*
-	 * Process eight bytes of data at a time.
-	 *
 	 * NB: We do unaligned accesses here. The Intel architecture allows that,
 	 * and performance testing didn't show any performance gain from aligning
 	 * the begin address.
 	 */
 #ifdef __x86_64__
+
+	/*
+	 * On large inputs, we divide into 3 equal chunks and compute CRC on each
+	 * independently. This works becasue we can combine the results using
+	 * carryless mulitplication. This is faster because on many architectures
+	 * a CRC instruction can be issued every cycle but the latency of its
+	 * result will take several cycles.
+	 */
+	while (p + min_blocklen * CRC_BYTES_PER_ITER <= pend)
+	{
+		const size_t block_len = Min(CRC_MAX_BLOCK_LEN, (pend - p) / CRC_BYTES_PER_ITER);
+		const uint64 *in64 = (const uint64 *) (p);
+		pg_crc32c	crc0 = crc,
+					crc1 = 0,
+					crc2 = 0;
+		uint64		mul0,
+					mul1,
+					precompute;
+
+		for (int i = 0; i < block_len; i++, in64++)
+		{
+			crc0 = _mm_crc32_u64(crc0, *(in64));
+			crc1 = _mm_crc32_u64(crc1, *(in64 + block_len));
+			crc2 = _mm_crc32_u64(crc2, *(in64 + block_len * 2));
+		}
+
+		precompute = combine_crc_lookup[block_len - 1];
+		mul0 = pg_clmul(crc0, (uint32) precompute);
+		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+
+		crc0 = _mm_crc32_u64(0, mul0);
+		crc1 = _mm_crc32_u64(0, mul1);
+		crc = crc0 ^ crc1 ^ crc2;
+
+		p += block_len * CRC_BYTES_PER_ITER;
+	}
+
+	/* Process eight bytes of data at a time. */
 	while (p + 8 <= pend)
 	{
 		crc = (uint32) _mm_crc32_u64(crc, *((const uint64 *) p));
@@ -66,5 +104,7 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 		p++;
 	}
 
+	/* XXX not for commit */
+	Assert(crc == pg_comp_crc32c_sb8(orig_crc, data, len));
 	return crc;
 }
-- 
2.47.1

v9-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload

From 0bd72ee6440ded81531b3e0ec3009550a068bf11 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v9 1/2] Add a Postgres SQL function for crc32c benchmarking

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

XXX not for commit

Extracted from a patch by  Raghuveer Devulapalli
---
 contrib/meson.build                          |  1 +
 contrib/test_crc32c/Makefile                 | 20 +++++++
 contrib/test_crc32c/expected/test_crc32c.out | 57 ++++++++++++++++++++
 contrib/test_crc32c/meson.build              | 34 ++++++++++++
 contrib/test_crc32c/sql/test_crc32c.sql      |  3 ++
 contrib/test_crc32c/test_crc32c--1.0.sql     |  1 +
 contrib/test_crc32c/test_crc32c.c            | 47 ++++++++++++++++
 contrib/test_crc32c/test_crc32c.control      |  4 ++
 8 files changed, 167 insertions(+)
 create mode 100644 contrib/test_crc32c/Makefile
 create mode 100644 contrib/test_crc32c/expected/test_crc32c.out
 create mode 100644 contrib/test_crc32c/meson.build
 create mode 100644 contrib/test_crc32c/sql/test_crc32c.sql
 create mode 100644 contrib/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 contrib/test_crc32c/test_crc32c.c
 create mode 100644 contrib/test_crc32c/test_crc32c.control

diff --git a/contrib/meson.build b/contrib/meson.build
index 159ff41555..9275cb4ae9 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ contrib_doc_args = {
   'install_dir': contrib_doc_dir,
 }
 
+subdir('test_crc32c')
 subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
diff --git a/contrib/test_crc32c/Makefile b/contrib/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/contrib/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_crc32c/expected/test_crc32c.out b/contrib/test_crc32c/expected/test_crc32c.out
new file mode 100644
index 0000000000..dff6bb3133
--- /dev/null
+++ b/contrib/test_crc32c/expected/test_crc32c.out
@@ -0,0 +1,57 @@
+CREATE EXTENSION test_crc32c;
+select drive_crc32c(1, i) from generate_series(100, 300, 4) i;
+ drive_crc32c 
+--------------
+    532139994
+   2103623867
+    785984197
+   2686825890
+   3213049059
+   3819630168
+   1389234603
+    534072900
+   2930108140
+   2496889855
+   1475239611
+    136366931
+   3067402116
+   2012717871
+   3682416023
+   2054270645
+   1817339875
+   4100939569
+   1192727539
+   3636976218
+    369764421
+   3161609879
+   1067984880
+   1235066769
+   3138425899
+    648132037
+   4203750233
+   1330187888
+   2683521348
+   1951644495
+   2574090107
+   3904902018
+   3772697795
+   1644686344
+   2868962106
+   3369218491
+   3902689890
+   3456411865
+    141004025
+   1504497996
+   3782655204
+   3544797610
+   3429174879
+   2524728016
+   3935861181
+     25498897
+    692684159
+    345705535
+   2761600287
+   2654632420
+   3945991399
+(51 rows)
+
diff --git a/contrib/test_crc32c/meson.build b/contrib/test_crc32c/meson.build
new file mode 100644
index 0000000000..d7bec4ba1c
--- /dev/null
+++ b/contrib/test_crc32c/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += test_crc32c
+
+install_data(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_crc32c',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_crc32c',
+    ],
+  },
+}
diff --git a/contrib/test_crc32c/sql/test_crc32c.sql b/contrib/test_crc32c/sql/test_crc32c.sql
new file mode 100644
index 0000000000..95c6dfe448
--- /dev/null
+++ b/contrib/test_crc32c/sql/test_crc32c.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_crc32c;
+
+select drive_crc32c(1, i) from generate_series(100, 300, 4) i;
diff --git a/contrib/test_crc32c/test_crc32c--1.0.sql b/contrib/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..52b9772f90
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/test_crc32c/test_crc32c.c b/contrib/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/contrib/test_crc32c/test_crc32c.control b/contrib/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.47.1

#33

Nathan Bossart

nathandbossart@gmail.com

about 1 year ago

In reply to: John Naylor (#32)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

I added a port to x86 and poked at it, with the intent to have an easy
on-ramp to that at least accelerates computation of CRCs on FPIs.

The 0008 patch only worked on chunks of 1024 at a time. At that size,
the presence of hardware carryless multiplication is not that
important. I removed the hard-coded constants in favor of a lookup
table, so now it can handle anything up to 8400 bytes in a single
pass.

There are still some "taste" issues, but I like the overall shape here
and how light it was. With more hardware support, we can go much lower
than 1024 bytes, but that can be left for future work.

Nice. I'm curious how this compares to both the existing implementations
and the proposed ones that require new intrinsics. I like the idea of
avoiding new runtime and config checks, especially if the performance is
somewhat comparable for the most popular cases (i.e., dozens of bytes to a
few thousand bytes).

If we still want to add new intrinsics, would it be easy enough to add them
on top of this patch? Or would it require further restructuring?

--
nathan

#34

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: Nathan Bossart (#33)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

and how light it was. With more hardware support, we can go much lower
than 1024 bytes, but that can be left for future work.

Nice. I'm curious how this compares to both the existing implementations
and the proposed ones that require new intrinsics. I like the idea of
avoiding new runtime and config checks, especially if the performance is
somewhat comparable for the most popular cases (i.e., dozens of bytes to a
few thousand bytes).

With 8k inputs on x86 its fairly close to 3x faster than master.

I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.

With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.

If we still want to add new intrinsics, would it be easy enough to add them
on top of this patch? Or would it require further restructuring?

I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:

#if defined(__has_attribute) && __has_attribute (target)

pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{
<big loop with special case for end>
<hardware carryless multiply>
<tail>
}

#endif

pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{
<big loop>
<software carryless multiply>
<tail>
}

...where we have the tail part in a separate function for readability.

On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.

--
John Naylor
Amazon Web Services

#35

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: Dmitry Dolgov (#30)

Re: CRC32C Parallel Computation Optimization on ARM

On Mon, Dec 2, 2024 at 2:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

One side note, I think it would be great to properly cite the white
paper the patch is referring to. Besides paying some respect to the
authors, it will also make it easier to actually find it. After a quick
search I found only some references to [1], but this link doesn't seem
to be available anymore.

I found an archive:

https://web.archive.org/web/20220802143127/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

One thing I noticed is this part:

"The basic concepts in this paper are derived from and explained in detail in
the patents and pending applications [4]Determining a Message Residue, Gopal et al. United States Patent 7,886,214[5]Determining a Message Residue Gueron et al. United States Patent Application 20090019342[6]Determining a Message Residue Gopal et al. United States Patent Application 20090158132."
...
[4]: Determining a Message Residue, Gopal et al. United States Patent 7,886,214
[5]: Determining a Message Residue Gueron et al. United States Patent Application 20090019342
20090019342
[6]: Determining a Message Residue Gopal et al. United States Patent Application 20090158132
20090158132

Searching for the first one gives

https://patents.google.com/patent/US20090019342

which says
"Status Expired - Fee Related
2029-09-03 Adjusted expiration"

On the other hand, looking at Linux kernel sources, it seems a patch
using this technique was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-asm_64.S

So one more thing to ask our friends at Intel.

--
John Naylor
Amazon Web Services

#36

Devulapalli, Raghuveer

raghuveer.devulapalli@intel.com

10 months ago

In reply to: John Naylor (#35)

RE: CRC32C Parallel Computation Optimization on ARM

Hi John,

On the other hand, looking at Linux kernel sources, it seems a patch using this
technique was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
asm_64.S

So one more thing to ask our friends at Intel.

Intel has contributed SSE4.2 CRC32C [1]/messages/by-id/PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com and AVX-512 CRC32C [2]/messages/by-id/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com based on similar techniques to postgres.

[1]: /messages/by-id/PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
[2]: /messages/by-id/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com

Raghuveer

#37

John Naylor

johncnaylorls@gmail.com

10 months ago

In reply to: Devulapalli, Raghuveer (#36)

Re: CRC32C Parallel Computation Optimization on ARM

On Tue, Mar 11, 2025 at 3:36 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

Hi John,

On the other hand, looking at Linux kernel sources, it seems a patch using this
technique was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
asm_64.S

So one more thing to ask our friends at Intel.

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.

[1] /messages/by-id/PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
[2] /messages/by-id/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com

No, these are not similar at all. I gave you the paper name and the
patents cited therein here:

/messages/by-id/CANWCAZbkt89_fVAaCAGBMznwA_xh=2Ci5q4GZytZHKjZAEjCRQ@mail.gmail.com

--
John Naylor
Amazon Web Services

#38

Devulapalli, Raghuveer

raghuveer.devulapalli@intel.com

10 months ago

In reply to: John Naylor (#37)

RE: CRC32C Parallel Computation Optimization on ARM

Hi John,

I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in the white paper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit of Postgres.

Raghuveer

Show quoted text

-----Original Message-----
From: John Naylor <johncnaylorls@gmail.com>
Sent: Monday, March 10, 2025 6:31 PM
To: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Cc: Dmitry Dolgov <9erthalion6@gmail.com>; Nathan Bossart
<nathandbossart@gmail.com>; Xiang Gao <Xiang.Gao@arm.com>; Michael
Paquier <michael@paquier.xyz>; pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM

On Tue, Mar 11, 2025 at 3:36 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

Hi John,

On the other hand, looking at Linux kernel sources, it seems a patch
using this technique was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c
-pcl-intel-
asm_64.S

So one more thing to ask our friends at Intel.

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on

similar techniques to postgres.

[1]
https://www.postgresql.org/message-

id/PH8PR11MB8286F844321BA1DEEC51834

8FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
[2]
https://www.postgresql.org/message-

id/BL1PR11MB530401FA7E9B1CA432CF9DC

3DC192@BL1PR11MB5304.namprd11.prod.outlook.com

No, these are not similar at all. I gave you the paper name and the patents cited
therein here:

https://www.postgresql.org/message-
id/CANWCAZbkt89_fVAaCAGBMznwA_xh%3D2Ci5q4GZytZHKjZAEjCRQ%40mail.g
mail.com

--
John Naylor
Amazon Web Services

#39

John Naylor

johncnaylorls@gmail.com

10 months ago

In reply to: Devulapalli, Raghuveer (#38)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Mar 12, 2025 at 12:46 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in the white paper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit of Postgres.

Thanks for offering, but I'm unclear if that's actually necessary. I'm
still confused as to what the status of the patents are. From your
last response:

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.

...this is a restatement of facts we already know. I'm guessing the
intended takeaway is "since Intel submitted an implementation to us
based on paper A, then we are free to separately also use a technique
from paper B (which cites patents)". I'd be delighted to hear that, if
that's what you found from talking to a legal team, but it's not clear
to me.

The original proposal that started this thread is below, and I'd like
to give that author credit for initiating that work, as long as there
is no legal issue with that:

/messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com

--
John Naylor
Amazon Web Services

#40

Devulapalli, Raghuveer

raghuveer.devulapalli@intel.com

10 months ago

In reply to: John Naylor (#39)

RE: CRC32C Parallel Computation Optimization on ARM

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on

similar techniques to postgres.

...this is a restatement of facts we already know. I'm guessing the intended
takeaway is "since Intel submitted an implementation to us based on paper A,
then we are free to separately also use a technique from paper B (which cites
patents)".

Yes.

The original proposal that started this thread is below, and I'd like to give that
author credit for initiating that work

Yup, that should be fine.

Raghuveer

#41

John Naylor

johncnaylorls@gmail.com

10 months ago

In reply to: Devulapalli, Raghuveer (#40)

3 attachment(s)

Re: CRC32C Parallel Computation Optimization on ARM

On Thu, Mar 13, 2025 at 12:50 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on

similar techniques to postgres.

...this is a restatement of facts we already know. I'm guessing the intended
takeaway is "since Intel submitted an implementation to us based on paper A,
then we are free to separately also use a technique from paper B (which cites
patents)".

Yes.

The original proposal that started this thread is below, and I'd like to give that
author credit for initiating that work

Yup, that should be fine.

Thank you for confirming. I've attached v10, which has mostly
polishing and comment writing, and a draft commit message. The lookup
table and software carryless multiplication routine are still in
pg_crc32c_sb.c , which is now built unconditionally. That's good
foreshadowing of future pclmul/pmull support, as I've found building
that file everywhere makes some things simpler anyway. That file has
become a bit of a misnomer, and I've thought of renaming it to
*_common.c or perhaps *_fallback.c , since the addition from this
patch is still kind of a fallback where we won't have the hardware
needed for faster algorithms, as discussed elsewhere.

0002-3 puts the relevant parts into a header so that the hardware
details can be abstracted away. These would be squashed, but I've kept
them separate here for comparison.

--
John Naylor
Amazon Web Services

Attachments:

v10-0002-Use-template-file-for-parallel-CRC-computation.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Use-template-file-for-parallel-CRC-computation.patchDownload

From f246bba2d9090116f8914c20114ecc3f4b9daeea Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 18 Mar 2025 12:56:32 +0700
Subject: [PATCH v10 2/3] Use template file for parallel CRC computation

---
 src/port/pg_crc32c_armv8.c    | 65 ++++++++++++++++------------------
 src/port/pg_crc32c_parallel.h | 66 +++++++++++++++++++++++++++++++++++
 src/port/pg_crc32c_sb8.c      |  2 ++
 src/port/pg_crc32c_sse42.c    | 65 +++++++---------------------------
 4 files changed, 112 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_parallel.h

diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index 0265a2a13d7..4767406bef3 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -18,13 +18,42 @@
 
 #include "port/pg_crc32c.h"
 
+#define DEBUG_CRC				/* XXX not for commit */
+
+static pg_crc32c pg_comp_crc32c_armv8_tail(pg_crc32c crc, const void *data, size_t len);
+
+
 pg_crc32c
 pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	pg_crc32c	crc0 = crc;
+
+#ifdef DEBUG_CRC
+	const size_t orig_len PG_USED_FOR_ASSERTS_ONLY = len;
+#endif
+
+/* min size to compute multiple segments in parallel */
+#define MIN_PARALLEL_LENGTH 600
+
+#define PG_CRC32C_1B(c, w) __crc32cb(c, w)
+#define PG_CRC32C_8B(c, w) __crc32cd(c, w)
+#include "pg_crc32c_parallel.h"
+
+	crc0 = pg_comp_crc32c_armv8_tail(crc0, p, len);
+
+#ifdef DEBUG_CRC
+	Assert(crc0 == pg_comp_crc32c_sb8(crc, data, orig_len));
+#endif
+
+	return crc0;
+}
+
+static pg_crc32c
+pg_comp_crc32c_armv8_tail(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
-	const size_t min_blocklen = 42; /* Min size to consider interleaving */
-	const pg_crc32c orig_crc = crc; // XXX not for commit
 
 	/*
 	 * ARMv8 doesn't require alignment, but aligned memory access is
@@ -50,36 +79,6 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
-	/* See pg_crc32c_sse42.c for explanation */
-	while (p + min_blocklen * CRC_BYTES_PER_ITER <= pend)
-	{
-		const size_t block_len = Min(CRC_MAX_BLOCK_LEN, (pend - p) / CRC_BYTES_PER_ITER);
-		const uint64 *in64 = (const uint64 *) (p);
-		pg_crc32c	crc0 = crc,
-					crc1 = 0,
-					crc2 = 0;
-		uint64		mul0,
-					mul1,
-					precompute;
-
-		for (int i = 0; i < block_len; i++, in64++)
-		{
-			crc0 = __crc32cd(crc0, *(in64));
-			crc1 = __crc32cd(crc1, *(in64 + block_len));
-			crc2 = __crc32cd(crc2, *(in64 + block_len * 2));
-		}
-
-		precompute = combine_crc_lookup[block_len - 1];
-		mul0 = pg_clmul(crc0, (uint32) precompute);
-		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
-
-		crc0 = __crc32cd(0, mul0);
-		crc1 = __crc32cd(0, mul1);
-		crc = crc0 ^ crc1 ^ crc2;
-
-		p += block_len * CRC_BYTES_PER_ITER;
-	}
-
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -103,7 +102,5 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		crc = __crc32cb(crc, *p);
 	}
 
-	// XXX not for commit
-	Assert(crc == pg_comp_crc32c_sb8(orig_crc, data, len));
 	return crc;
 }
diff --git a/src/port/pg_crc32c_parallel.h b/src/port/pg_crc32c_parallel.h
new file mode 100644
index 00000000000..caee564726e
--- /dev/null
+++ b/src/port/pg_crc32c_parallel.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_parallel.h
+ *	  Hardware-independent template for parallel CRC computation.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ *	  src/port/pg_crc32c_parallel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_CRC32C_H
+#define PG_CRC32C_H
+
+if (unlikely(len >= MIN_PARALLEL_LENGTH))
+{
+	/*
+	 * Align pointer regardless of architecture to avoid straddling cacheline
+	 * boundaries, since we issue three loads per loop iteration below.
+	 */
+	for (; (uintptr_t) p & 7; len--)
+		crc0 = PG_CRC32C_1B(crc0, *p++);
+
+	/*
+	 * A CRC instruction can be issued every cycle on many architectures, but
+	 * the latency of its result will take several cycles. We can take
+	 * advantage of this by dividing the input into 3 equal blocks and
+	 * computing the CRC of each independently.
+	 */
+	while (len >= MIN_PARALLEL_LENGTH)
+	{
+		const size_t block_len = Min(CRC_MAX_BLOCK_LEN,
+									 len / CRC_BYTES_PER_ITER);
+		const uint64 *in64 = (const uint64 *) (p);
+		pg_crc32c	crc1 = 0,
+					crc2 = 0;
+		uint64		mul0,
+					mul1,
+					precompute;
+
+		for (int i = 0; i < block_len; i++, in64++)
+		{
+			crc0 = PG_CRC32C_8B(crc0, *(in64));
+			crc1 = PG_CRC32C_8B(crc1, *(in64 + block_len));
+			crc2 = PG_CRC32C_8B(crc2, *(in64 + block_len * 2));
+		}
+
+		/*
+		 * Combine the partial CRCs using carryless multiplication on
+		 * pre-computed length-specific constants.
+		 */
+		precompute = combine_crc_lookup[block_len - 1];
+		mul0 = pg_clmul(crc0, (uint32) precompute);
+		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+		crc0 = PG_CRC32C_8B(0, mul0);
+		crc0 ^= PG_CRC32C_8B(0, mul1);
+		crc0 ^= crc2;
+
+		p += block_len * CRC_BYTES_PER_ITER;
+		len -= block_len * CRC_BYTES_PER_ITER;
+	}
+}
+
+#endif							/* PG_CRC32C_H */
diff --git a/src/port/pg_crc32c_sb8.c b/src/port/pg_crc32c_sb8.c
index 004fe92d70b..d0ec8c5bed0 100644
--- a/src/port/pg_crc32c_sb8.c
+++ b/src/port/pg_crc32c_sb8.c
@@ -1169,6 +1169,8 @@ static const uint32 pg_crc32c_table[8][256] = {
 };
 
 
+/* platform-independent infrastructure for parallel CRC computation */
+
 /*
  * Carryless multiplication in software
  */
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index f674d3f71d7..3fe8716601f 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -18,8 +18,7 @@
 
 #include "port/pg_crc32c.h"
 
-/* min size to compute multiple segments in parallel */
-#define MIN_PARALLEL_LENGTH 600
+#define DEBUG_CRC				/* XXX not for commit */
 
 static pg_crc32c pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len);
 
@@ -31,64 +30,26 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 	const unsigned char *p = data;
 	pg_crc32c	crc0 = crc;
 
-	/* XXX not for commit */
+#ifdef DEBUG_CRC
 	const size_t orig_len PG_USED_FOR_ASSERTS_ONLY = len;
+#endif
 
 #if SIZEOF_VOID_P >= 8
-	if (unlikely(len >= MIN_PARALLEL_LENGTH))
-	{
-		/*
-		 * Align pointer to avoid straddling cacheline boundaries, since we
-		 * issue three loads per loop iteration below.
-		 */
-		for (; (uintptr_t) p & 7; len--)
-			crc0 = _mm_crc32_u8(crc0, *p++);
-
-		/*
-		 * A CRC instruction can be issued every cycle but the latency of its
-		 * result will take several cycles. We can take advantage of this by
-		 * dividing the input into 3 equal blocks and computing the CRC of
-		 * each independently.
-		 */
-		while (len >= MIN_PARALLEL_LENGTH)
-		{
-			const size_t block_len = Min(CRC_MAX_BLOCK_LEN,
-										 len / CRC_BYTES_PER_ITER);
-			const uint64 *in64 = (const uint64 *) (p);
-			pg_crc32c	crc1 = 0,
-						crc2 = 0;
-			uint64		mul0,
-						mul1,
-						precompute;
-
-			for (int i = 0; i < block_len; i++, in64++)
-			{
-				crc0 = _mm_crc32_u64(crc0, *(in64));
-				crc1 = _mm_crc32_u64(crc1, *(in64 + block_len));
-				crc2 = _mm_crc32_u64(crc2, *(in64 + block_len * 2));
-			}
-
-			/*
-			 * Combine the partial CRCs using carryless multiplication on
-			 * pre-computed length-specific constants.
-			 */
-			precompute = combine_crc_lookup[block_len - 1];
-			mul0 = pg_clmul(crc0, (uint32) precompute);
-			mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
-			crc0 = _mm_crc32_u64(0, mul0);
-			crc0 ^= _mm_crc32_u64(0, mul1);
-			crc0 ^= crc2;
-
-			p += block_len * CRC_BYTES_PER_ITER;
-			len -= block_len * CRC_BYTES_PER_ITER;
-		}
-	}
+
+/* min size to compute multiple segments in parallel */
+#define MIN_PARALLEL_LENGTH 600
+
+#define PG_CRC32C_1B(c, w) _mm_crc32_u8(c, w)
+#define PG_CRC32C_8B(c, w) _mm_crc32_u64(c, w)
+#include "pg_crc32c_parallel.h"
+
 #endif
 
 	crc0 = pg_comp_crc32c_sse42_tail(crc0, p, len);
 
-	/* XXX not for commit */
+#ifdef DEBUG_CRC
 	Assert(crc0 == pg_comp_crc32c_sb8(crc, data, orig_len));
+#endif
 
 	return crc0;
 }
-- 
2.48.1

v10-0001-Execute-hardware-CRC-computation-in-parallel.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Execute-hardware-CRC-computation-in-parallel.patchDownload

From 2306c2a94e1610b76a1b7584bbaea63d8bf57302 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 11 Dec 2024 11:30:33 +0700
Subject: [PATCH v10 1/3] Execute hardware CRC computation in parallel

CRC computations on the current input word depend not only on that
input, but also on the CRC of the previous input. This means that
the speed is limited by the latency of the CRC instruction.

Most modern CPUs can start executing a new CRC instruction before a
currently executing one has finished, i.e. the reciprocal throughput
is lower than latency. By computing partial CRCs of non-overlapping
segments of the input, we can achieve the full throughput that the
CPU is capable of. To preserve the correctness of the result, however,
we must recombine the partial results using carryless multiplication
with constants specific to the input length. We get these from a
lookup table of pre-computed CRCs. Because of the overhead of the
recombinination step, parallelism is only faster with inputs of at
least a few hundred bytes.

For now we only implement parallelism for x86 and Arm. It might be
worthwhile to apply this technique to LoongArch, depending on the
throughput of CRC on that platform.

XXX The lookup table and supporting code is found in pg_crc32c_sb.c,
which is now built unconditionally on all platforms. Perhaps
s/sb8/common/ ?

This technique originated from the Intel white paper "Fast CRC
Computation for iSCSI Polynomial Using CRC32 Instruction", by Vinodh
Gopal et al, 2011. Thanks to Raghuveer Devulapalli for assistance in
verifying the usability of this technique from a legal perspective.

Xiang Gao's original proposal was specific to the Arm architecture,
computed in fixed-size chunks of 1024 bytes, and required hardware
support for carryless multiplication. I added support for x86 and
a wider range of chunk sizes, and switched to pure C for carryless
multiplication. The portability of the latter is important for two
reasons: 1) We may want to use this technique on architectures that
don't have hardware carryless multiplication and 2) This is intended as
a fallback, since if hardware carryless multiplication is available,
there are other algorithms that are useful on much smaller inputs
than this one.

Author: Xiang Gao <xiang.gao@arm.com>
Author: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com
---
 configure                    |   5 +-
 configure.ac                 |   5 +-
 src/include/port/pg_crc32c.h |  12 ++++
 src/port/Makefile            |   1 +
 src/port/meson.build         |   6 +-
 src/port/pg_crc32c_armv8.c   |  34 +++++++++++
 src/port/pg_crc32c_sb8.c     | 111 +++++++++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42.c   |  77 +++++++++++++++++++++++-
 8 files changed, 239 insertions(+), 12 deletions(-)

diff --git a/configure b/configure
index 93fddd69981..3403bb0c931 100755
--- a/configure
+++ b/configure
@@ -17692,7 +17692,7 @@ else
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sse42_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
   else
@@ -17708,7 +17708,7 @@ $as_echo "ARMv8 CRC instructions" >&6; }
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_armv8_choose.o"
         { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
       else
@@ -17723,7 +17723,6 @@ $as_echo "LoongArch CRCC instructions" >&6; }
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
           { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
         fi
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..855949b7d74 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2156,7 +2156,7 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sse42_choose.o"
     AC_MSG_RESULT(SSE 4.2 with runtime check)
   else
     if test x"$USE_ARMV8_CRC32C" = x"1"; then
@@ -2166,7 +2166,7 @@ else
     else
       if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
         AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_armv8_choose.o"
         AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
       else
         if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
@@ -2175,7 +2175,6 @@ else
           AC_MSG_RESULT(LoongArch CRCC instructions)
         else
           AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
           AC_MSG_RESULT(slicing-by-8)
         fi
       fi
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 65ebeacf4b1..e6c149c71f1 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -41,6 +41,8 @@ typedef uint32 pg_crc32c;
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
@@ -107,4 +109,14 @@ extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len)
 
 #endif
 
+/* semi-private to files in src/port that compute CRCs in parallel */
+
+#define CRC_BYTES_PER_ITER (3 * sizeof(uint64))
+/* for parallel computation, max number of words per block for recombination */
+#define CRC_MAX_BLOCK_LEN 350
+
+extern PGDLLIMPORT const uint64 combine_crc_lookup[CRC_MAX_BLOCK_LEN];
+
+extern uint64 pg_clmul(uint32 a, uint32 b);
+
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..a8e2467a866 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_crc32c_sb8.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..8aed1de2d1d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_crc32c_sb8.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
@@ -84,19 +85,14 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
 
   # loongarch
   ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'],
-
-  # generic fallback
-  ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
 pgport_cflags = {'crc': cflags_crc}
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index 5ba070bb99d..0265a2a13d7 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -23,6 +23,8 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
+	const size_t min_blocklen = 42; /* Min size to consider interleaving */
+	const pg_crc32c orig_crc = crc; // XXX not for commit
 
 	/*
 	 * ARMv8 doesn't require alignment, but aligned memory access is
@@ -48,6 +50,36 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		p += 4;
 	}
 
+	/* See pg_crc32c_sse42.c for explanation */
+	while (p + min_blocklen * CRC_BYTES_PER_ITER <= pend)
+	{
+		const size_t block_len = Min(CRC_MAX_BLOCK_LEN, (pend - p) / CRC_BYTES_PER_ITER);
+		const uint64 *in64 = (const uint64 *) (p);
+		pg_crc32c	crc0 = crc,
+					crc1 = 0,
+					crc2 = 0;
+		uint64		mul0,
+					mul1,
+					precompute;
+
+		for (int i = 0; i < block_len; i++, in64++)
+		{
+			crc0 = __crc32cd(crc0, *(in64));
+			crc1 = __crc32cd(crc1, *(in64 + block_len));
+			crc2 = __crc32cd(crc2, *(in64 + block_len * 2));
+		}
+
+		precompute = combine_crc_lookup[block_len - 1];
+		mul0 = pg_clmul(crc0, (uint32) precompute);
+		mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+
+		crc0 = __crc32cd(0, mul0);
+		crc1 = __crc32cd(0, mul1);
+		crc = crc0 ^ crc1 ^ crc2;
+
+		p += block_len * CRC_BYTES_PER_ITER;
+	}
+
 	/* Process eight bytes at a time, as far as we can. */
 	while (p + 8 <= pend)
 	{
@@ -71,5 +103,7 @@ pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len)
 		crc = __crc32cb(crc, *p);
 	}
 
+	// XXX not for commit
+	Assert(crc == pg_comp_crc32c_sb8(orig_crc, data, len));
 	return crc;
 }
diff --git a/src/port/pg_crc32c_sb8.c b/src/port/pg_crc32c_sb8.c
index 19659d186a0..004fe92d70b 100644
--- a/src/port/pg_crc32c_sb8.c
+++ b/src/port/pg_crc32c_sb8.c
@@ -1167,3 +1167,114 @@ static const uint32 pg_crc32c_table[8][256] = {
 	}
 #endif							/* WORDS_BIGENDIAN */
 };
+
+
+/*
+ * Carryless multiplication in software
+ */
+uint64
+pg_clmul(uint32 a, uint32 b)
+{
+	uint64		result = 0;
+
+	for (uint32 i = 0; i < 32; i++)
+		if ((a >> i) & 1)
+			result ^= (uint64) b << i;
+
+	return result;
+}
+
+/*
+ * Lookup table for combining partial CRC computations
+ */
+const uint64 combine_crc_lookup[CRC_MAX_BLOCK_LEN] =
+{
+	0x00000001493c7d27, 0x493c7d27ba4fc28e, 0xf20c0dfeddc0152b, 0xba4fc28e9e4addf8,
+	0x3da6d0cb39d3b296, 0xddc0152b0715ce53, 0x1c291d0447db8317, 0x9e4addf80d3b6092,
+	0x740eef02c96cfdc0, 0x39d3b296878a92a7, 0x083a6eecdaece73e, 0x0715ce53ab7aff2a,
+	0xc49f4f672162d385, 0x47db831783348832, 0x2ad91c30299847d5, 0x0d3b6092b9e02b86,
+	0x6992cea218b33a4e, 0xc96cfdc0b6dd949b, 0x7e90804878d9ccb7, 0x878a92a7bac2fd7b,
+	0x1b3d8f29a60ce07b, 0xdaece73ece7f39f4, 0xf1d0f55e61d82e56, 0xab7aff2ad270f1a2,
+	0xa87ab8a8c619809d, 0x2162d3852b3cac5d, 0x8462d80065863b64, 0x833488321b03397f,
+	0x71d111a8ebb883bd, 0x299847d5b3e32c28, 0xffd852c6064f7f26, 0xb9e02b86dd7e3b0c,
+	0xdcb17aa4f285651c, 0x18b33a4e10746f3c, 0xf37c5aeec7a68855, 0xb6dd949b271d9844,
+	0x6051d5a28e766a0c, 0x78d9ccb793a5f730, 0x18b0d4ff6cb08e5c, 0xbac2fd7b6b749fb2,
+	0x21f3d99c1393e203, 0xa60ce07bcec3662e, 0x8f15801496c515bb, 0xce7f39f4e6fc4e6a,
+	0xa00457f78227bb8a, 0x61d82e56b0cd4768, 0x8d6d2c4339c7ff35, 0xd270f1a2d7a4825c,
+	0x00ac29cf0ab3844b, 0xc619809d0167d312, 0xe9adf796f6076544, 0x2b3cac5d26f6a60a,
+	0x96638b34a741c1bf, 0x65863b6498d8d9cb, 0xe0e9f35149c3cc9c, 0x1b03397f68bce87a,
+	0x9af01f2d57a3d037, 0xebb883bd6956fc3b, 0x2cff42cf42d98888, 0xb3e32c283771e98f,
+	0x88f25a3ab42ae3d9, 0x064f7f262178513a, 0x4e36f0b0e0ac139e, 0xdd7e3b0c170076fa,
+	0xbd6f81f8444dd413, 0xf285651c6f345e45, 0x91c9bd4b41d17b64, 0x10746f3cff0dba97,
+	0x885f087ba2b73df1, 0xc7a68855f872e54c, 0x4c1449321e41e9fc, 0x271d984486d8e4d2,
+	0x52148f02651bd98b, 0x8e766a0c5bb8f1bc, 0xa3c6f37aa90fd27a, 0x93a5f730b3af077a,
+	0xd7c0557f4984d782, 0x6cb08e5cca6ef3ac, 0x63ded06a234e0b26, 0x6b749fb2dd66cbbb,
+	0x4d56973c4597456a, 0x1393e203e9e28eb4, 0x9669c9df7b3ff57a, 0xcec3662ec9c8b782,
+	0xe417f38a3f70cc6f, 0x96c515bb93e106a4, 0x4b9e0f7162ec6c6d, 0xe6fc4e6ad813b325,
+	0xd104b8fc0df04680, 0x8227bb8a2342001e, 0x5b3977300a2a8d7e, 0xb0cd47686d9a4957,
+	0xe78eb416e8b6368b, 0x39c7ff35d2c3ed1a, 0x61ff0e01995a5724, 0xd7a4825c9ef68d35,
+	0x8d96551c0c139b31, 0x0ab3844bf2271e60, 0x0bf80dd20b0bf8ca, 0x0167d3122664fd8b,
+	0x8821abeded64812d, 0xf607654402ee03b2, 0x6a45d2b28604ae0f, 0x26f6a60a363bd6b3,
+	0xd8d26619135c83fd, 0xa741c1bf5fabe670, 0xde87806c35ec3279, 0x98d8d9cb00bcf5f6,
+	0x143387548ae00689, 0x49c3cc9c17f27698, 0x5bd2011f58ca5f00, 0x68bce87aaa7c7ad5,
+	0xdd07448eb5cfca28, 0x57a3d037ded288f8, 0xdde8f5b959f229bc, 0x6956fc3b6d390dec,
+	0xa3e3e02c37170390, 0x42d988886353c1cc, 0xd73c7beac4584f5c, 0x3771e98ff48642e9,
+	0x80ff0093531377e2, 0xb42ae3d9dd35bc8d, 0x8fe4c34db25b29f2, 0x2178513a9a5ede41,
+	0xdf99fc11a563905d, 0xe0ac139e45cddf4e, 0x6c23e841acfa3103, 0x170076faa51b6135,
+	0xfe314258dfd94fb2, 0x444dd41380f2886b, 0x0d8373a067969a6a, 0x6f345e45021ac5ef,
+	0x19e3635ee8310afa, 0x41d17b6475451b04, 0x29f268b48e1450f7, 0xff0dba97cbbe4ee1,
+	0x1dc0632a3a83de21, 0xa2b73df1e0cdcf86, 0x1614f396453c1679, 0xf872e54cdefba41c,
+	0x9e2993d3613eee91, 0x1e41e9fcddaf5114, 0x6bebd73c1f1dd124, 0x86d8e4d2bedc6ba1,
+	0x63ae91e6eca08ffe, 0x651bd98b3ae30875, 0xf8c9da7a0cd1526a, 0x5bb8f1bcb1630f04,
+	0x945a19c1ff47317b, 0xa90fd27ad6c3a807, 0xee8213b79a7781e0, 0xb3af077a63d097e9,
+	0x93781dc71d31175f, 0x4984d78294eb256e, 0xccc4a1b913184649, 0xca6ef3ac4be7fd90,
+	0xa2c2d9717d5c1d64, 0x234e0b2680ba859a, 0x1cad44526eeed1c9, 0xdd66cbbb22c3799f,
+	0x74922601d8ecc578, 0x4597456ab3a6da94, 0xc55f7eabcaf933fe, 0xe9e28eb450bfaade,
+	0xa19623292e7d11a7, 0x7b3ff57a7d14748f, 0x2d37074932d8041c, 0xc9c8b782889774e1,
+	0x397d84a16cc8a0ff, 0x3f70cc6f5aa1f3cf, 0x791132708a074012, 0x93e106a433bc58b3,
+	0xbc8178039f2b002a, 0x62ec6c6dbd0bb25f, 0x88eb3c0760bf0a6a, 0xd813b3258515c07f,
+	0x6e4cb6303be3c09b, 0x0df04680d8440525, 0x71971d5c682d085d, 0x2342001e465a4eee,
+	0xf33b8bc628b5de82, 0x0a2a8d7e077d54e0, 0x9fb3bbc02e5f3c8c, 0x6d9a4957c00df280,
+	0x6ef22b23d0a37f43, 0xe8b6368ba52f58ec, 0xce2df76800712e86, 0xd2c3ed1ad6748e82,
+	0xe53a4fc747972100, 0x995a572451aeef66, 0xbe60a91a71900712, 0x9ef68d35359674f7,
+	0x1dfa0a15647fbd15, 0x0c139b311baaa809, 0x8ec52396469aef86, 0xf2271e6086d42d06,
+	0x0e766b114aba1470, 0x0b0bf8ca1c2cce0a, 0x475846a4aa0cd2d3, 0x2664fd8bf8448640,
+	0xb2a3dfa6ac4fcdec, 0xed64812de81cf154, 0xdc1a160cc2c7385c, 0x02ee03b295ffd7dc,
+	0x79afdf1c91de6176, 0x8604ae0f84ee89ac, 0x07ac6e46533e308d, 0x363bd6b35f0e0438,
+	0x15f85253604d6e09, 0x135c83fdaeb3e622, 0x1bec24dd4263eb04, 0x5fabe67050c2cb16,
+	0x4c36cd5b6667afe7, 0x35ec32791a6889b8, 0xe0a22e29de42c92a, 0x00bcf5f67f47463d,
+	0x7c2b6ed9b82b6080, 0x8ae00689828d550b, 0x06ff88fddca2b4da, 0x17f276984ac726eb,
+	0xf7317cf0529295e6, 0x58ca5f005e9f28eb, 0x61b6e40b40c14fff, 0xaa7c7ad596a1f19b,
+	0xde8a97f8997157e1, 0xb5cfca28b0ed8196, 0x88f61445097e41e6, 0xded288f84ce8bfe5,
+	0xd4520e9ee36841ad, 0x59f229bcd1a9427c, 0x0c592bd593f3319c, 0x6d390decb58ac6fe,
+	0x38edfaf3e3809241, 0x37170390f22fd3e2, 0x72cbfcdb83c2df88, 0x6353c1ccd6b1825a,
+	0x348331a54e4ff232, 0xc4584f5c6664d9c1, 0xc3977c19836b5a6e, 0xf48642e923d5e7e5,
+	0xdafaea7c65065343, 0x531377e21495d20d, 0x73db4c04a29c82eb, 0xdd35bc8df370b37f,
+	0x72675ce8ea6dd7dc, 0xb25b29f2e9415bce, 0x3ec2ff8396309b0f, 0x9a5ede41c776b648,
+	0xe8c7a017c22c52c5, 0xa563905dcecfcd43, 0xcf4bfaefd8311ee7, 0x45cddf4e24e6fe8f,
+	0x6bde1ac7d0c6d7c9, 0xacfa310345aa5d4a, 0xae1175c2cf067065, 0xa51b613582f89c77,
+	0xf7506984a348c84e, 0xdfd94fb2d07737ea, 0xe0863e5636069dd2, 0x80f2886bc4cedd32,
+	0xd7e661ae9a97be47, 0x67969a6af45cd585, 0x01afc14f93f36e2b, 0x021ac5ef195bc82d,
+	0xd2fd8e3ce622aaca, 0xe8310afa23912612, 0xc4eb27b2a1fd0859, 0x75451b04a2edbd17,
+	0x632098732cefbfdd, 0x8e1450f7f36d84e2, 0xf29971cf9664532d, 0xcbbe4ee1cfeff4b3,
+	0xaf6939d96737eead, 0x3a83de21f52d28d3, 0x650ef6c5fb3bb2c8, 0xe0cdcf864a9d4498,
+	0x36e108faaef471c1, 0x453c16790d08bb68, 0x09c20a6c3b6c03be, 0xdefba41c4de20a7c,
+	0x0a1a6a8877792405, 0x613eee91b95a9eb0, 0x286d109b11f2bc8f, 0xddaf51147956e76a,
+	0x9fd51b88032a8058, 0x1f1dd1241b93589c, 0x4860285dcc66546f, 0xbedc6ba1005bb964,
+	0x6e221adc28198362, 0xeca08ffe3f2e57b1, 0x0e0a10735f54bb14, 0x3ae3087599e44d01,
+	0x37f194212591f073, 0x0cd1526a0871bd30, 0xe9bbf6481fb48d12, 0xb1630f043888ed03,
+	0x0fa0277f1e22167e, 0xff47317bd272cadf, 0xeb2fb89a8653215c, 0xd6c3a807f6e6d69e,
+	0x1c47ed30b9f5bf62, 0x9a7781e09dde6145, 0x271cfb40ed49c4d3, 0x63d097e91ad321bc,
+	0xea1cb6e7f206e4b8, 0x1d31175f60165873, 0x9f737f83159dff70, 0x94eb256ee1a468d0,
+	0xd4619bbcda11f51b, 0x1318464993189d18, 0x794dd0f2ac4f4691, 0x4be7fd9087d07ae0,
+	0xb6d42cd90db9f589, 0x7d5c1d64c51a240c, 0x8b9be230ab819fbb, 0x80ba859a5cc19671,
+	0xd4617a4c46183f0a, 0x6eeed1c993e7c448, 0xb9f93bd067fe6e36, 0x22c3799fca110698,
+	0xb8b67c1c8a2acc83, 0xd8ecc578eb75a090, 0xc5ca433a3e18bd99, 0xb3a6da94480e7e4d,
+	0x5e5dcd9560bced33, 0xcaf933fe0bf69a3f, 0x7b589372dbf59471, 0x50bfaadea00bae3d,
+	0x0afb7f3b4b7df256, 0x2e7d11a77959fe2a, 0x0f97c69068e3d179, 0x7d14748fa160e585,
+	0x3e254fe4cac39a0b, 0x32d8041cf4229b7a, 0x141e8512007ca0f9, 0x889774e10e0126b2,
+	0xe5e25bd082fe946e, 0x6cc8a0ff013b3856, 0xacf1231667c69966, 0x5aa1f3cffa0f2bd0,
+	0x7b454cb35d4c91fc, 0x8a07401279d95f64, 0x311709b8807121c0, 0x33bc58b3b0a3f16d,
+	0x9948a7d2618e0996, 0x9f2b002a2308bca9, 0x809cef1f343272b3, 0xbd0bb25ff5c40599,
+	0x11dde5b740c1c64c, 0x60bf0a6a71cc89e8,
+};
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index 22c2137df31..f674d3f71d7 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -18,10 +18,85 @@
 
 #include "port/pg_crc32c.h"
 
-pg_attribute_no_sanitize_alignment()
+/* min size to compute multiple segments in parallel */
+#define MIN_PARALLEL_LENGTH 600
+
+static pg_crc32c pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len);
+
+
 pg_attribute_target("sse4.2")
 pg_crc32c
 pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
+{
+	const unsigned char *p = data;
+	pg_crc32c	crc0 = crc;
+
+	/* XXX not for commit */
+	const size_t orig_len PG_USED_FOR_ASSERTS_ONLY = len;
+
+#if SIZEOF_VOID_P >= 8
+	if (unlikely(len >= MIN_PARALLEL_LENGTH))
+	{
+		/*
+		 * Align pointer to avoid straddling cacheline boundaries, since we
+		 * issue three loads per loop iteration below.
+		 */
+		for (; (uintptr_t) p & 7; len--)
+			crc0 = _mm_crc32_u8(crc0, *p++);
+
+		/*
+		 * A CRC instruction can be issued every cycle but the latency of its
+		 * result will take several cycles. We can take advantage of this by
+		 * dividing the input into 3 equal blocks and computing the CRC of
+		 * each independently.
+		 */
+		while (len >= MIN_PARALLEL_LENGTH)
+		{
+			const size_t block_len = Min(CRC_MAX_BLOCK_LEN,
+										 len / CRC_BYTES_PER_ITER);
+			const uint64 *in64 = (const uint64 *) (p);
+			pg_crc32c	crc1 = 0,
+						crc2 = 0;
+			uint64		mul0,
+						mul1,
+						precompute;
+
+			for (int i = 0; i < block_len; i++, in64++)
+			{
+				crc0 = _mm_crc32_u64(crc0, *(in64));
+				crc1 = _mm_crc32_u64(crc1, *(in64 + block_len));
+				crc2 = _mm_crc32_u64(crc2, *(in64 + block_len * 2));
+			}
+
+			/*
+			 * Combine the partial CRCs using carryless multiplication on
+			 * pre-computed length-specific constants.
+			 */
+			precompute = combine_crc_lookup[block_len - 1];
+			mul0 = pg_clmul(crc0, (uint32) precompute);
+			mul1 = pg_clmul(crc1, (uint32) (precompute >> 32));
+			crc0 = _mm_crc32_u64(0, mul0);
+			crc0 ^= _mm_crc32_u64(0, mul1);
+			crc0 ^= crc2;
+
+			p += block_len * CRC_BYTES_PER_ITER;
+			len -= block_len * CRC_BYTES_PER_ITER;
+		}
+	}
+#endif
+
+	crc0 = pg_comp_crc32c_sse42_tail(crc0, p, len);
+
+	/* XXX not for commit */
+	Assert(crc0 == pg_comp_crc32c_sb8(crc, data, orig_len));
+
+	return crc0;
+}
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("sse4.2")
+static pg_crc32c
+pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
-- 
2.48.1

v10-0003-Fix-headerscheck.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Fix-headerscheck.patchDownload

From 93b46829b80d33be5afee1156dbdf074f097fbdd Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 18 Mar 2025 18:41:35 +0700
Subject: [PATCH v10 3/3] Fix headerscheck

---
 src/tools/pginclude/headerscheck | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 9e86d049362..3883ec558e5 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -122,6 +122,9 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# pg_crc32c_parallel.h contains just a code fragment
+	test "$f" = src/port/pg_crc32c_parallel.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.48.1

#42

Bruce Momjian

bruce@momjian.us

10 months ago

In reply to: John Naylor (#39)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Mar 12, 2025 at 01:51:08PM +0700, John Naylor wrote:

On Wed, Mar 12, 2025 at 12:46 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in the white paper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit of Postgres.

Thanks for offering, but I'm unclear if that's actually necessary. I'm
still confused as to what the status of the patents are. From your
last response:

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.

...this is a restatement of facts we already know. I'm guessing the
intended takeaway is "since Intel submitted an implementation to us
based on paper A, then we are free to separately also use a technique
from paper B (which cites patents)". I'd be delighted to hear that, if
that's what you found from talking to a legal team, but it's not clear
to me.

Contributing code is copyright, which is unrelated to patents. I don't
think the Postgres community even has a method of accepting patent usage
grants.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#43

John Naylor

johncnaylorls@gmail.com

10 months ago

In reply to: Bruce Momjian (#42)

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Mar 19, 2025 at 12:54 AM Bruce Momjian <bruce@momjian.us> wrote:

Contributing code is copyright, which is unrelated to patents. I don't
think the Postgres community even has a method of accepting patent usage
grants.

Since the legal status is still unclear, I've marked the CF entry
Returned with Feedback.

--
John Naylor
Amazon Web Services