Proposal for Updating CRC32C with AVX-512 Algorithm.

Started by Amonson, Paul Dover 1 year ago63 messages
#1Amonson, Paul D
paul.d.amonson@intel.com

Hi,

Comparing the current SSE4.2 implementation of the CRC32C algorithm in Postgres, to an optimized AVX-512 algorithm [0]https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text we observed significant gains. The result was a ~6.6X average multiplier of increased performance measured on 3 different Intel products. Details below. The AVX-512 algorithm in C is a port of the ISA-L library [1]https://github.com/intel/isa-l assembler code.

Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192

The 256 bytes is important because if the buffer is smaller, it makes sense fallback to the existing implementation. This is because the AVX-512 algorithm needs a minimum of 256 bytes to operate.

Using the above workload data distribution,
at 0% calls < 256 bytes, a 841% improvement on average for crc32c functionality was observed.
at 50% calls < 256 bytes, a 758% improvement on average for crc32c functionality was observed.
at 90% calls < 256 bytes, a 44% improvement on average for crc32c functionality was observed.
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100% calls < 256 bytes, a 14% regression is seen when using AVX-512 implementation.

The results above are averages over 3 machines, and were measured on: Intel Saphire Rapids bare metal, and using EC2 on AWS cloud: Intel Saphire Rapids (m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge).

Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+--------------------+
| Rates in Bytes/us   |     Bare Metal    |    AWS m6i-2xl    |   AWS m7i-2xl     |                    |
| (Larger is Better)  +---------+---------+---------+---------+---------+---------+ Overall Multiplier |
|                     | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 |                    |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 256-8192    |  12,046 |  83,196 |   7,471 |  39,965 |  11,867 |  84,589 |        6.62        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 64 - 255    |  16,865 |  15,909 |   9,209 |   7,363 |  12,496 |  10,046 |        0.86        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
                                                    |  Weighted Multiplier [*]    |        1.44        |
                                                    +-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which stayed steady during the test.

Feedback on this proposed improvement is appreciated. Some questions:
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]https://opensource.org/license/bsd-3-clause. Is this compatible with the PostgreSQL License [3]https://opensource.org/license/postgresql? They both appear to be very permissive licenses, but I am not an expert on licenses.
2) Is there a preferred benchmark I should run to test this change?

If licensing is a non-issue, I can post the initial patch along with my Postgres benchmark function patch for further review.

Thanks,
Paul

[0]: https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
[1]: https://github.com/intel/isa-l
[2]: https://opensource.org/license/bsd-3-clause
[3]: https://opensource.org/license/postgresql

[*] Weights used were 90% of requests less than 256 bytes, 10% greater than or equal to 256 bytes.

#2Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#1)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi, forgive the top-post but I have not seen any response to this post?

Thanks,
Paul

Show quoted text

-----Original Message-----
From: Amonson, Paul D
Sent: Wednesday, May 1, 2024 8:56 AM
To: pgsql-hackers@lists.postgresql.org
Cc: Nathan Bossart <nathandbossart@gmail.com>; Shankaran, Akash
<akash.shankaran@intel.com>
Subject: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

Comparing the current SSE4.2 implementation of the CRC32C algorithm in
Postgres, to an optimized AVX-512 algorithm [0] we observed significant
gains. The result was a ~6.6X average multiplier of increased performance
measured on 3 different Intel products. Details below. The AVX-512 algorithm
in C is a port of the ISA-L library [1] assembler code.

Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max
buffer size of 8192

The 256 bytes is important because if the buffer is smaller, it makes sense
fallback to the existing implementation. This is because the AVX-512 algorithm
needs a minimum of 256 bytes to operate.

Using the above workload data distribution,
at 0% calls < 256 bytes, a 841% improvement on average for crc32c
functionality was observed.
at 50% calls < 256 bytes, a 758% improvement on average for crc32c
functionality was observed.
at 90% calls < 256 bytes, a 44% improvement on average for crc32c
functionality was observed.
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100% calls < 256 bytes, a 14% regression is seen when using AVX-512
implementation.

The results above are averages over 3 machines, and were measured on: Intel
Saphire Rapids bare metal, and using EC2 on AWS cloud: Intel Saphire Rapids
(m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge).

Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+---------
-----------+
| Rates in Bytes/us   |     Bare Metal    |    AWS m6i-2xl    |   AWS m7i-2xl     |
|
| (Larger is Better)  +---------+---------+---------+---------+---------+---------+
Overall Multiplier |
|                     | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 |
|
+---------------------+---------+---------+---------+---------+---------+---------+-------
-------------+
| Numbers 256-8192    |  12,046 |  83,196 |   7,471 |  39,965 |  11,867 |
84,589 |        6.62        |
+---------------------+---------+---------+---------+---------+---------+---------+-------
-------------+
| Numbers 64 - 255    |  16,865 |  15,909 |   9,209 |   7,363 |  12,496 |
10,046 |        0.86        |
+---------------------+---------+---------+---------+---------+---------+---------+-------
-------------+
|  Weighted Multiplier [*]    |        1.44        |
+-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which
stayed steady during the test.

Feedback on this proposed improvement is appreciated. Some questions:
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]. Is this compatible
with the PostgreSQL License [3]? They both appear to be very permissive
licenses, but I am not an expert on licenses.
2) Is there a preferred benchmark I should run to test this change?

If licensing is a non-issue, I can post the initial patch along with my Postgres
benchmark function patch for further review.

Thanks,
Paul

[0]
https://www.researchgate.net/publication/263424619_Fast_CRC_computati
on#full-text
[1] https://github.com/intel/isa-l
[2] https://opensource.org/license/bsd-3-clause
[3] https://opensource.org/license/postgresql

[*] Weights used were 90% of requests less than 256 bytes, 10% greater than
or equal to 256 bytes.

#3Daniel Gustafsson
daniel@yesql.se
In reply to: Amonson, Paul D (#2)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On 17 May 2024, at 18:21, Amonson, Paul D <paul.d.amonson@intel.com> wrote:

Hi, forgive the top-post but I have not seen any response to this post?

The project is currently in feature-freeze in preparation for the next major
release so new development and ideas are not the top priority right now.
Additionally there is a large developer meeting shortly which many are busy
preparing for. Excercise some patience, and I'm sure there will be follow-ups
to this once development of postgres v18 picks up.

--
Daniel Gustafsson

#4Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Daniel Gustafsson (#3)
2 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

The project is currently in feature-freeze in preparation for the next major
release so new development and ideas are not the top priority right now.
Additionally there is a large developer meeting shortly which many are busy
preparing for. Excercise some patience, and I'm sure there will be follow-ups
to this once development of postgres v18 picks up.

Thanks, understood.

I had our OSS internal team, who are experts in OSS licensing, review possible conflicts between the PostgreSQL license and the BSD-Clause 3-like license for the CRC32C AVX-512 code, and they found no issues. Therefore, including the new license into the PostgreSQL codebase should be acceptable.

I am attaching the first official patches. The second patch is a simple test function in PostgreSQL SQL, which I used for testing and benchmarking. It will not be merged.

Code Structure Question: While working on this code, I noticed overlaps with runtime CPU checks done in the previous POPCNT merged code. I was considering that these checks should perhaps be formalized and consolidated into a single source/header file pair. If this is desirable, where should I place these files? Should it be in "src/port" where they are used, or in "src/common" where they are available to all (not just the "src/port" tree)?

Thanks,
Paul

Attachments:

0001-v2-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchapplication/octet-stream; name=0001-v2-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchDownload
From 067c488c4355ba89f2d5460e72ae44d99229119b Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 21 May 2024 13:23:39 -0700
Subject: [PATCH] [Feat] Add-AVX512 crc32c algorithm to postgres

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4               |  48 +++++++
 configure                          | 223 +++++++++++++++++++++++------
 configure.ac                       | 106 +++++++++-----
 meson.build                        |  41 +++++-
 src/include/pg_config.h.in         |   3 +
 src/include/port/pg_crc32c.h       |  24 +++-
 src/port/Makefile                  |  10 ++
 src/port/meson.build               |   4 +
 src/port/pg_crc32c_avx512.c        | 222 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c | 202 ++++++++++++++++++++++++++
 10 files changed, 797 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 7b03db56a6..45cd755867 100755
--- a/configure
+++ b/configure
@@ -14898,7 +14898,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14944,7 +14944,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14968,7 +14968,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15013,7 +15013,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15037,7 +15037,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17774,6 +17774,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17946,31 +18063,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17989,44 +18117,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 63e7be3847..73ea4d95dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2124,6 +2124,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2169,31 +2180,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2208,29 +2230,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index f5ca5cfed4..f7e9eb5ecb 100644
--- a/meson.build
+++ b/meson.build
@@ -2109,6 +2109,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2122,13 +2150,20 @@ int main(void)
     return crc == 0;
 }
 '''
-
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f8d3e3b6b8..6e08f1c6c7 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -738,6 +738,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..b632ac7d59 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -86,7 +109,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #endif
-
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..7ae632c6fc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -88,11 +88,21 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512_choose.o need CFLAGS_XSAVE
+pg_crc32c_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
 # all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
 pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
 pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
diff --git a/src/port/meson.build b/src/port/meson.build
index fd9ee199d1..d635913e9b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -83,6 +83,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..085c8d99a8
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corporation
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+static
+pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009,
+ *  https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
+ *
+ * This Function:
+ * Copyright 2017 The Chromium Authors
+ * Copyright (c) 2024, Intel(r) Corporation
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the Chromium source repository LICENSE file.
+ * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+		/*
+		 * There's at least one block of 256.
+		 */
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		 * Parallel fold blocks of 256, if any.
+		 */
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+		}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..d5ccb69d10
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,202 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corp.
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_crc32c.h"
+
+typedef unsigned int exx_t;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t field.
+ */
+inline
+static
+bool
+is_bit_set(exx_t reg, int bit)
+{
+	return (reg & (1 << bit)) != 0;
+}
+
+/*
+ * Intel Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Intel Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline
+static
+bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 20); /* sse4.2 */
+}
+
+/*
+ * Check for CPU supprt for CPUID: osxsave
+ */
+inline
+static
+bool
+osxsave_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 27); /* osxsave */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline
+static
+bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 31); /* avx512-vl */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline
+static
+bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static
+pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
-- 
2.34.1

0002-Test-Add-a-Postgres-SQL-function-for-crc32c-testing.patchapplication/octet-stream; name=0002-Test-Add-a-Postgres-SQL-function-for-crc32c-testing.patchDownload
From 13e19b131002a89b36ea8533f3e05a77c9f6de22 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH] [Test] Add a Postgres SQL function for crc32c testing.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 src/test/modules/test_crc32c/Makefile         | 20 ++++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 39 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 4 files changed, 64 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..477f198316
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,41 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+
+#include "port/pg_crc32c.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	pg_crc32c		crc		= 0xFFFFFFFF;
+	const char*		data	= malloc((size_t)num);
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		memset((void*)data, count, (size_t)Min(16,num));
+		crc = COMP_CRC32C(crc, data, num);
+	}
+
+	FIN_CRC32C(crc);
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.34.1

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amonson, Paul D (#4)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

"Amonson, Paul D" <paul.d.amonson@intel.com> writes:

I had our OSS internal team, who are experts in OSS licensing, review possible conflicts between the PostgreSQL license and the BSD-Clause 3-like license for the CRC32C AVX-512 code, and they found no issues. Therefore, including the new license into the PostgreSQL codebase should be acceptable.

Maybe you should get some actual lawyers to answer this type of
question. The Chromium license this code cites is 3-clause-BSD
style, which is NOT compatible: the "advertising" clause is
significant.

In any case, writing copyright notices that are pointers to
external web pages is not how it's done around here. We generally
operate on the assumption that the Postgres source code will
outlive any specific web site. Dead links to incidental material
might be okay, but legally relevant stuff not so much.

regards, tom lane

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Wed, Jun 12, 2024 at 02:08:02PM -0400, Tom Lane wrote:

"Amonson, Paul D" <paul.d.amonson@intel.com> writes:

I had our OSS internal team, who are experts in OSS licensing, review possible conflicts between the PostgreSQL license and the BSD-Clause 3-like license for the CRC32C AVX-512 code, and they found no issues. Therefore, including the new license into the PostgreSQL codebase should be acceptable.

Maybe you should get some actual lawyers to answer this type of
question. The Chromium license this code cites is 3-clause-BSD
style, which is NOT compatible: the "advertising" clause is
significant.

In any case, writing copyright notices that are pointers to
external web pages is not how it's done around here. We generally
operate on the assumption that the Postgres source code will
outlive any specific web site. Dead links to incidental material
might be okay, but legally relevant stuff not so much.

Agreed. The licenses are compatible in the sense that they can be
combined to create a unified work, but they cannot be combined without
modifying the license of the combined work. You would need to combine
the Postgres and Chrome license for this, and I highly doubt we are
going to be modifying the Postgres for this.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#7Andres Freund
andres@anarazel.de
In reply to: Amonson, Paul D (#4)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

I'm wonder if this isn't going in the wrong direction. We're using CRCs for
something they're not well suited for in my understanding - and are paying a
reasonably high price for it, given that even hardware accelerated CRCs aren't
blazingly fast.

CRCs are used for things like ethernet, iSCSI because they are good at
detecting the kinds of errors encountered, namely short bursts of
bitflips. And the covered data is limited to a fairly small limit.

Which imo makes CRCs a bad choice for WAL. For one, we don't actually expect a
short burst of bitflips, the most likely case is all bits after some point
changing (because only one part of the record made it to disk). For another,
WAL records are *not* limited to a small size, and if anything error detection
becomes more important with longer records (they're likely to be span more
pages / segments).

It's hard to understand, but a nonetheless helpful page is
https://users.ece.cmu.edu/~koopman/crc/crc32.html which lists properties for
crc32c:
https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt
which lists
(0x8f6e37a0; 0x11edc6f41) <=> (0x82f63b78; 0x105ec76f1) {2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1} | gold | (*op) iSCSI; CRC-32C; CRC-32/4

This cryptic notion AFAIU indicates that for our polynomial we can detect 2bit
errors up to a length of 2147483615 bytes, 3 bit errors up to 2147483615, 3
and 4 bit errors up to 5243, 5 and 6 bit errors up to 177, 7/8 bit errors up
to 47.

IMO for our purposes just about all errors are going to be at least at sector
boundaries, i.e. 512 bytes and thus are at least 8 bit large. At that point we
are only guaranteed to find a single-byte error (it'll be common to have
much more) up to a lenght of 47bits. Which isn't a useful guarantee.

With that I perhaps have established that CRC guarantees aren't useful for us.
But not yet why we should use something else: Given that we already aren't
relying on hard guarantees, we could instead just use a fast hash like xxh3.
https://github.com/Cyan4973/xxHash which is fast both for large and small
amounts of data.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Amonson, Paul D (#1)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

On 2024-05-01 15:56:08 +0000, Amonson, Paul D wrote:

Comparing the current SSE4.2 implementation of the CRC32C algorithm in
Postgres, to an optimized AVX-512 algorithm [0] we observed significant
gains. The result was a ~6.6X average multiplier of increased performance
measured on 3 different Intel products. Details below. The AVX-512 algorithm
in C is a port of the ISA-L library [1] assembler code.

Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192

This is extremely workload dependent, it's not hard to find workloads with
lots of very small record and very few big ones... What you observed might
have "just" been the warmup behaviour where more full page writes have to be
written.

There a very frequent call computing COMP_CRC32C over just 20 bytes, while
holding a crucial lock. If we were to do introduce something like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in case of
compile-time known small lengths.

How does the latency of the AVX-512 algorithm compare to just using the CRC32C
instruction?

FWIW, I tried the v2 patch on my Xeon Gold 5215 workstation, and dies early on
with SIGILL:

Program terminated with signal SIGILL, Illegal instruction.
#0 0x0000000000d5946c in _mm512_clmulepi64_epi128 (__A=..., __B=..., __C=0)
at /home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/15/include/vpclmulqdqintrin.h:42
42 return (__m512i) __builtin_ia32_vpclmulqdq_v8di ((__v8di)__A,
(gdb) bt
#0 0x0000000000d5946c in _mm512_clmulepi64_epi128 (__A=..., __B=..., __C=0)
at /home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/15/include/vpclmulqdqintrin.h:42
#1 pg_comp_crc32c_avx512 (crc=<optimized out>, data=<optimized out>, length=<optimized out>)
at ../../../../../home/andres/src/postgresql/src/port/pg_crc32c_avx512.c:163
#2 0x0000000000819343 in ReadControlFile () at ../../../../../home/andres/src/postgresql/src/backend/access/transam/xlog.c:4375
#3 0x000000000081c4ac in LocalProcessControlFile (reset=<optimized out>) at ../../../../../home/andres/src/postgresql/src/backend/access/transam/xlog.c:4817
#4 0x0000000000a8131d in PostmasterMain (argc=argc@entry=85, argv=argv@entry=0x341b08f0)
at ../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:902
#5 0x00000000009b53fe in main (argc=85, argv=0x341b08f0) at ../../../../../home/andres/src/postgresql/src/backend/main/main.c:197

Cascade lake doesn't have vpclmulqdq, so we shouldn't be getting here...

This is on an optimied build with meson, with -march=native included in
c_flags.

Relevant configure output:

Checking if "XSAVE intrinsics without -mxsave" : links: NO (cached)
Checking if "XSAVE intrinsics with -mxsave" : links: YES (cached)
Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO (cached)
Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES (cached)
Checking if "_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq" : links: YES
Checking if "x86_64: popcntq instruction" compiles: YES (cached)

Greetings,

Andres Freund

#9Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Andres Freund (#8)
1 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Wednesday, June 12, 2024 1:12 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>

FWIW, I tried the v2 patch on my Xeon Gold 5215 workstation, and dies early
on with SIGILL:

Nice catch!!! I was testing the bit for the vpclmulqdq in EBX instead of the correct ECX register. New Patch attached. I added defines to make that easier to see those types of bugs rather than a simple index number. I double checked the others as well.

Paul

Attachments:

0001-v3-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchapplication/octet-stream; name=0001-v3-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchDownload
From 0b8f9851f38444dbe29009120d98cd38e93efe7f Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 21 May 2024 13:23:39 -0700
Subject: [PATCH] [Feat] Add-AVX512 crc32c algorithm to postgres

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4               |  48 +++++++
 configure                          | 223 +++++++++++++++++++++++------
 configure.ac                       | 106 +++++++++-----
 meson.build                        |  41 +++++-
 src/include/pg_config.h.in         |   3 +
 src/include/port/pg_crc32c.h       |  24 +++-
 src/port/Makefile                  |  10 ++
 src/port/meson.build               |   4 +
 src/port/pg_crc32c_avx512.c        | 222 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c | 202 ++++++++++++++++++++++++++
 10 files changed, 797 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 7b03db56a6..45cd755867 100755
--- a/configure
+++ b/configure
@@ -14898,7 +14898,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14944,7 +14944,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14968,7 +14968,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15013,7 +15013,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15037,7 +15037,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17774,6 +17774,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17946,31 +18063,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17989,44 +18117,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 63e7be3847..73ea4d95dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2124,6 +2124,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2169,31 +2180,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2208,29 +2230,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index f9279c837d..a2b087d561 100644
--- a/meson.build
+++ b/meson.build
@@ -2144,6 +2144,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2157,13 +2185,20 @@ int main(void)
     return crc == 0;
 }
 '''
-
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f8d3e3b6b8..6e08f1c6c7 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -738,6 +738,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..b632ac7d59 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -86,7 +109,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #endif
-
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..7ae632c6fc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -88,11 +88,21 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512_choose.o need CFLAGS_XSAVE
+pg_crc32c_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
 # all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
 pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
 pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
diff --git a/src/port/meson.build b/src/port/meson.build
index fd9ee199d1..d635913e9b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -83,6 +83,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..085c8d99a8
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corporation
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+static
+pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009,
+ *  https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
+ *
+ * This Function:
+ * Copyright 2017 The Chromium Authors
+ * Copyright (c) 2024, Intel(r) Corporation
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the Chromium source repository LICENSE file.
+ * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+		/*
+		 * There's at least one block of 256.
+		 */
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		 * Parallel fold blocks of 256, if any.
+		 */
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+		}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..d5ccb69d10
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,202 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corp.
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_crc32c.h"
+
+typedef unsigned int exx_t;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t field.
+ */
+inline
+static
+bool
+is_bit_set(exx_t reg, int bit)
+{
+	return (reg & (1 << bit)) != 0;
+}
+
+/*
+ * Intel Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Intel Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline
+static
+bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 20); /* sse4.2 */
+}
+
+/*
+ * Check for CPU supprt for CPUID: osxsave
+ */
+inline
+static
+bool
+osxsave_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 27); /* osxsave */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline
+static
+bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 31); /* avx512-vl */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline
+static
+bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static
+pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
-- 
2.34.1

#10Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Andres Freund (#8)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

This is extremely workload dependent, it's not hard to find workloads with
lots of very small record and very few big ones... What you observed might
have "just" been the warmup behaviour where more full page writes have to
be written.

Can you tell me how to avoid capturing this "warm-up" so that the numbers are more accurate?

There a very frequent call computing COMP_CRC32C over just 20 bytes, while
holding a crucial lock. If we were to do introduce something like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in case of
compile-time known small lengths.

So are you suggesting that we be able to directly call into the 64/32 bit based algorithm directly from these known small byte cases in the code? I think that we can do that with a separate API being exposed.

How does the latency of the AVX-512 algorithm compare to just using the
CRC32C instruction?

I think I need more information on this one as I am not sure I understand the use case? The same function pointer indirect methods are used with or without the AVX-512 algorithm?

Paul

#11Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Amonson, Paul D (#9)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On 2024-Jun-12, Amonson, Paul D wrote:

+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corporation
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */

Hmm, I wonder if the "(c) 2024 Intel" line is going to bring us trouble.
(I bet it's not really necessary anyway.)

+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009,
+ *  https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
+ *
+ * This Function:
+ * Copyright 2017 The Chromium Authors
+ * Copyright (c) 2024, Intel(r) Corporation
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the Chromium source repository LICENSE file.
+ * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE
+ */

And this bit doesn't look good. The LICENSE file says:

// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
// * Neither the name of Google LLC nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.

The second clause essentially says we would have to add a page to our
"documentation and/or other materials" with the contents of the license
file.

There's good reasons for UCB to have stopped using the old BSD license,
but apparently Google (or more precisely the Chromium authors) didn't
get the memo.

Our fork distributors spent a lot of time scouring out source cleaning
up copyrights, a decade ago or two. I bet they won't be happy to see
this sort of thing crop up now.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

#12Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Alvaro Herrera (#11)
2 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hmm, I wonder if the "(c) 2024 Intel" line is going to bring us trouble.
(I bet it's not really necessary anyway.)

Our lawyer agrees, copyright is covered by the "PostgreSQL Global Development Group" copyright line as a contributor.

And this bit doesn't look good. The LICENSE file says:

...

// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following
disclaimer // in the documentation and/or other materials provided
with the // distribution.

...

The second clause essentially says we would have to add a page to our
"documentation and/or other materials" with the contents of the license file.

According to one of Intel’s lawyers, 55 instances of this clause was found when they searched in the PostgreSQL repository. Therefore, I assume that this obligation has either been satisfied or determined not to apply, given that the second BSD clause already appears in the PostgreSQL source tree. I might have misunderstood the concern, but the lawyer believes this is a non-issue. Could you please provide more clarifying details about the concern?

Thanks,
Paul

Attachments:

0002-v4-Fix-Copyright-and-Licensing-issues.patchapplication/octet-stream; name=0002-v4-Fix-Copyright-and-Licensing-issues.patchDownload
From 76b5c7ce6b0c7ddb6aa4ac1c2b8c05a6702a1975 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 18 Jun 2024 09:00:53 -0700
Subject: [PATCH 2/2] [Fix] Copyright and Licensing issues.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 src/port/pg_crc32c_avx512.c        | 113 +++++++++++++++++------------
 src/port/pg_crc32c_avx512_choose.c |  15 ++--
 2 files changed, 75 insertions(+), 53 deletions(-)

diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index 085c8d99a8..da1a01b974 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -5,7 +5,6 @@
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
- * Portions Copyright (c) 2024, Intel(r) Corporation
  *
  * IDENTIFICATION
  *	  src/port/pg_crc32c_avx512.c
@@ -71,16 +70,36 @@ crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
  *
  * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
  * Instruction"
- *  V. Gopal, E. Ozturk, et al., 2009,
- *  https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
+ *  V. Gopal, E. Ozturk, et al., 2009
  *
- * This Function:
- * Copyright 2017 The Chromium Authors
- * Copyright (c) 2024, Intel(r) Corporation
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
  *
- * Use of this source code is governed by a BSD-style license that can be
- * found in the Chromium source repository LICENSE file.
- * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 pg_attribute_no_sanitize_alignment()
 inline
@@ -112,48 +131,48 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 		 * to 32 bytes.
 		 * >>> BEGIN
 		 */
-		/*
-		 * There's at least one block of 256.
-		 */
-		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
-		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
-		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
-		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+/*
+ * There's at least one block of 256.
+ */
+x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
 
-		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
 
-		x0 = _mm512_load_si512((__m512i *)k1k2);
+x0 = _mm512_load_si512((__m512i *)k1k2);
 
-		input += 256;
-		length -= 256;
+input += 256;
+length -= 256;
 
-		/*
-		 * Parallel fold blocks of 256, if any.
-		 */
-		while (length >= 256)
-		{
-			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
-			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
-			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
-
-			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
-			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
-			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
-			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
-
-			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
-			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
-			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
-			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
-
-			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
-			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
-			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
-
-			input += 256;
-			length -= 256;
+/*
+ * Parallel fold blocks of 256, if any.
+ */
+while (length >= 256)
+{
+	x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+	x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+	x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+	x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+	x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+	x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+	x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+	x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+	y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+	y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+	y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+	y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+	x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+	x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+	x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+	x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+	input += 256;
+	length -= 256;
 		}
 
 		/*
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
index d5ccb69d10..f774522715 100644
--- a/src/port/pg_crc32c_avx512_choose.c
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -10,7 +10,6 @@
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
- * Portions Copyright (c) 2024, Intel(r) Corp.
  *
  *
  * IDENTIFICATION
@@ -36,6 +35,10 @@
 #include "port/pg_crc32c.h"
 
 typedef unsigned int exx_t;
+#define EAX 0
+#define EBX 1
+#define ECX 2
+#define EDX 3
 
 /*
  * Helper function.
@@ -94,7 +97,7 @@ sse42_available(void)
 	exx_t exx[4] = {0, 0, 0, 0};
 
 	pg_getcpuid(1, exx);
-	return is_bit_set(exx[2], 20); /* sse4.2 */
+	return is_bit_set(exx[ECX], 20); /* sse4.2 */
 }
 
 /*
@@ -108,7 +111,7 @@ osxsave_available(void)
 	exx_t exx[4] = {0, 0, 0, 0};
 
 	pg_getcpuid(1, exx);
-	return is_bit_set(exx[2], 27); /* osxsave */
+	return is_bit_set(exx[ECX], 27); /* osxsave */
 }
 
 /*
@@ -122,7 +125,7 @@ avx512f_available(void)
 	exx_t exx[4] = {0, 0, 0, 0};
 
 	pg_getcpuidex(7, 0, exx);
-	return is_bit_set(exx[1], 16); /* avx512-f */
+	return is_bit_set(exx[EBX], 16); /* avx512-f */
 }
 
 /*
@@ -136,7 +139,7 @@ vpclmulqdq_available(void)
 	exx_t exx[4] = {0, 0, 0, 0};
 
 	pg_getcpuidex(7, 0, exx);
-	return is_bit_set(exx[1], 10); /* vpclmulqdq */
+	return is_bit_set(exx[ECX], 10); /* vpclmulqdq */
 }
 
 /*
@@ -150,7 +153,7 @@ avx512vl_available(void)
 	exx_t exx[4] = {0, 0, 0, 0};
 
 	pg_getcpuidex(7, 0, exx);
-	return is_bit_set(exx[1], 31); /* avx512-vl */
+	return is_bit_set(exx[EBX], 31); /* avx512-vl */
 }
 
 /*
-- 
2.34.1

0001-v4-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchapplication/octet-stream; name=0001-v4-Feat-Add-AVX512-crc32c-algorithm-to-postgres.patchDownload
From be762f9ad9910e0c9aeaf0f7ed4ee71b2fe8e220 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 21 May 2024 13:23:39 -0700
Subject: [PATCH 1/2] [Feat] Add-AVX512 crc32c algorithm to postgres

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4               |  48 +++++++
 configure                          | 223 +++++++++++++++++++++++------
 configure.ac                       | 106 +++++++++-----
 meson.build                        |  41 +++++-
 src/include/pg_config.h.in         |   3 +
 src/include/port/pg_crc32c.h       |  24 +++-
 src/port/Makefile                  |  10 ++
 src/port/meson.build               |   4 +
 src/port/pg_crc32c_avx512.c        | 222 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c | 202 ++++++++++++++++++++++++++
 10 files changed, 797 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 7b03db56a6..45cd755867 100755
--- a/configure
+++ b/configure
@@ -14898,7 +14898,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14944,7 +14944,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14968,7 +14968,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15013,7 +15013,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15037,7 +15037,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17774,6 +17774,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17946,31 +18063,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17989,44 +18117,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 63e7be3847..73ea4d95dd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2124,6 +2124,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2169,31 +2180,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2208,29 +2230,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index 2767abd19e..a1c09cb1e6 100644
--- a/meson.build
+++ b/meson.build
@@ -2144,6 +2144,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2157,13 +2185,20 @@ int main(void)
     return crc == 0;
 }
 '''
-
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f8d3e3b6b8..6e08f1c6c7 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -738,6 +738,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..b632ac7d59 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -86,7 +109,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 #ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #endif
-
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..7ae632c6fc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -88,11 +88,21 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512_choose.o need CFLAGS_XSAVE
+pg_crc32c_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_crc32c_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
 # all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
 pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
 pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
diff --git a/src/port/meson.build b/src/port/meson.build
index fd9ee199d1..d635913e9b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -83,6 +83,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..085c8d99a8
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corporation
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+static
+pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009,
+ *  https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
+ *
+ * This Function:
+ * Copyright 2017 The Chromium Authors
+ * Copyright (c) 2024, Intel(r) Corporation
+ *
+ * Use of this source code is governed by a BSD-style license that can be
+ * found in the Chromium source repository LICENSE file.
+ * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE
+ */
+pg_attribute_no_sanitize_alignment()
+inline
+pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+		/*
+		 * There's at least one block of 256.
+		 */
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		 * Parallel fold blocks of 256, if any.
+		 */
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+		}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..d5ccb69d10
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,202 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * Portions Copyright (c) 2024, Intel(r) Corp.
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_crc32c.h"
+
+typedef unsigned int exx_t;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t field.
+ */
+inline
+static
+bool
+is_bit_set(exx_t reg, int bit)
+{
+	return (reg & (1 << bit)) != 0;
+}
+
+/*
+ * Intel Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Intel Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline
+static
+void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline
+static
+bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 20); /* sse4.2 */
+}
+
+/*
+ * Check for CPU supprt for CPUID: osxsave
+ */
+inline
+static
+bool
+osxsave_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set(exx[2], 27); /* osxsave */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline
+static
+bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline
+static
+bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set(exx[1], 31); /* avx512-vl */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline
+static
+bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static
+pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
-- 
2.34.1

#13Bruce Momjian
bruce@momjian.us
In reply to: Amonson, Paul D (#12)
1 attachment(s)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Jun 18, 2024 at 05:14:08PM +0000, Amonson, Paul D wrote:

And this bit doesn't look good. The LICENSE file says:

...

// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following
disclaimer // in the documentation and/or other materials provided
with the // distribution.

...

The second clause essentially says we would have to add a page to our
"documentation and/or other materials" with the contents of the license file.

According to one of Intel’s lawyers, 55 instances of this clause was found when they searched in the PostgreSQL repository. Therefore, I assume that this obligation has either been satisfied or determined not to apply, given that the second BSD clause already appears in the PostgreSQL source tree. I might have misunderstood the concern, but the lawyer believes this is a non-issue. Could you please provide more clarifying details about the concern?

Yes, I can confirm that:

grep -Rl 'Redistributions in binary form must reproduce' . | wc -l

reports 54; file list attached.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

Attachments:

files.txttext/plain; charset=us-asciiDownload
#14Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#13)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Jun 18, 2024 at 01:20:50PM -0400, Bruce Momjian wrote:

On Tue, Jun 18, 2024 at 05:14:08PM +0000, Amonson, Paul D wrote:

And this bit doesn't look good. The LICENSE file says:

...

// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following
disclaimer // in the documentation and/or other materials provided
with the // distribution.

...

The second clause essentially says we would have to add a page to our
"documentation and/or other materials" with the contents of the license file.

According to one of Intel’s lawyers, 55 instances of this clause was found when they searched in the PostgreSQL repository. Therefore, I assume that this obligation has either been satisfied or determined not to apply, given that the second BSD clause already appears in the PostgreSQL source tree. I might have misunderstood the concern, but the lawyer believes this is a non-issue. Could you please provide more clarifying details about the concern?

Yes, I can confirm that:

grep -Rl 'Redistributions in binary form must reproduce' . | wc -l

reports 54; file list attached.

I am somewhat embarrassed by this since we made the Intel lawyers find
something that was in our own source code.

First, the "advertizing clause" in the 4-clause license:

3. All advertising materials mentioning features or use of this
software must display the following acknowledgement: This product
includes software developed by the University of California,
Berkeley and its contributors.

and was disavowed by Berkeley on July 22nd, 1999:

https://elrc-share.eu/static/metashare/licences/BSD-3-Clause.pdf

While the license we are concerned about does not have this clause, it
does have:

2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.

I assume that must also include the name of the copyright holder.

I think that means we need to mention The Regents of the University of
California in our copyright notice, which we do. However several
non-Regents of the University of California copyright holder licenses
exist in our source tree, and accepting this AVX-512 patch would add
another one. Specifically, I see existing entries for:

Aaron D. Gifford
Board of Trustees of the University of Illinois
David Burren
Eric P. Allman
Jens Schweikhardt
Marko Kreen
Sun Microsystems, Inc.
WIDE Project

Now, some of these are these names plus Berkeley, and some are just the
names above.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#15Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#14)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Jun 18, 2024 at 02:00:34PM -0400, Bruce Momjian wrote:

While the license we are concerned about does not have this clause, it
does have:

2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.

I assume that must also include the name of the copyright holder.

I think that means we need to mention The Regents of the University of
California in our copyright notice, which we do. However several
non-Regents of the University of California copyright holder licenses
exist in our source tree, and accepting this AVX-512 patch would add
another one. Specifically, I see existing entries for:

Aaron D. Gifford
Board of Trustees of the University of Illinois
David Burren
Eric P. Allman
Jens Schweikhardt
Marko Kreen
Sun Microsystems, Inc.
WIDE Project

Now, some of these are these names plus Berkeley, and some are just the
names above.

In summary, either we are doing something wrong in how we list
copyrights in our documentation, or we don't need to make any changes for
this Intel patch.

Our license is at:

https://www.postgresql.org/about/licence/

The Intel copyright in the source code is:

* Copyright 2017 The Chromium Authors
* Copyright (c) 2024, Intel(r) Corporation
*
* Use of this source code is governed by a BSD-style license that can be
* found in the Chromium source repository LICENSE file.
* https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE

and the URL contents are:

// Copyright 2015 The Chromium Authors
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
// * Neither the name of Google LLC nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Google LLC is added to clause three, and I assume Intel is also covered
by this because it is considered "the names of its contributors", maybe?

It would be good to know exactly what, if any, changes the Intel lawyers
want us to make to our license if we accept this patch.

There are also different versions of clause three in our source tree.
The Postgres license only lists the University of California in our
equivalent of clause three, meaning that there are three-clause BSD
licenses in our source tree that reference entities that we don't
reference in the Postgres license. Oddly, the Postgres license doesn't
even disclaim warranties for the PostgreSQL Global Development Group,
only for Berkeley.

An even bigger issue is that we are distributing 3-clause BSD licensed
software under the Postgres license, which is not the 3-clause BSD
license. I think we were functioning under the assuption that the
licenses are compatibile, so can be combined, which is true, but I don't
think we can assume the individual licenses can be covered by our one
license, can we?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#16Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Bruce Momjian (#15)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

It would be good to know exactly what, if any, changes the Intel lawyers want
us to make to our license if we accept this patch.

I asked about this and there is nothing Intel requires here license wise. They believe that there is nothing wrong with including Clause-3 BSD like licenses under the PostgreSQL license. They only specified that for the source file, the applying license need to be present either as a link (which was previously discouraged in this thread) or the full text. Please note that I checked and for this specific Chromium license there is not SPDX codename so the entire text is required.

Thanks,
Paul

#17Bruce Momjian
bruce@momjian.us
In reply to: Amonson, Paul D (#16)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Jun 25, 2024 at 05:41:12PM +0000, Amonson, Paul D wrote:

It would be good to know exactly what, if any, changes the Intel
lawyers want us to make to our license if we accept this patch.

I asked about this and there is nothing Intel requires here license
wise. They believe that there is nothing wrong with including Clause-3
BSD like licenses under the PostgreSQL license. They only specified
that for the source file, the applying license need to be present
either as a link (which was previously discouraged in this thread)
or the full text. Please note that I checked and for this specific
Chromium license there is not SPDX codename so the entire text is
required.

Okay, that is very interesting. Yes, we will have no problem
reproducing the exact license text in the source code. I think we can
remove the license issue as a blocker for this patch.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#18Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Bruce Momjian (#17)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Okay, that is very interesting. Yes, we will have no problem reproducing the
exact license text in the source code. I think we can remove the license issue
as a blocker for this patch.

Hi,

I was wondering if I can I get a review please. I am interested in the refactor question for the HW capability tests as well as an actual implementation review. I create a commit fest entry for this thread.

Thanks,
Paul

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#7)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Wed, Jun 12, 2024 at 12:37:46PM -0700, Andres Freund wrote:

I'm wonder if this isn't going in the wrong direction. We're using CRCs for
something they're not well suited for in my understanding - and are paying a
reasonably high price for it, given that even hardware accelerated CRCs aren't
blazingly fast.

I tend to agree, especially that we should be more concerned about all
bytes after a certain point being garbage than bit flips. (I think we
should also care about bit flips, but I hope those are much less common
than half-written WAL records.)

With that I perhaps have established that CRC guarantees aren't useful for us.
But not yet why we should use something else: Given that we already aren't
relying on hard guarantees, we could instead just use a fast hash like xxh3.
https://github.com/Cyan4973/xxHash which is fast both for large and small
amounts of data.

Would it be out of the question to reuse the page checksum code (i.e., an
FNV-1a derivative)? The chart in your link claims that xxh3 is
substantially faster than "FNV64", but I wonder if the latter was
vectorized. I don't know how our CRC-32C implementations (and proposed
implementations) compare, either.

--
nathan

#20Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#18)
2 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

Here are the latest patches for the accelerated CRC32c algorithm. I did the following to create these refactored patches:

1) From the main branch I moved all x86_64 hardware checks from the various locations into a single location. I did not move any ARM tests as I would have no way to test them for validity. However, an ARM section could be added to my consolidated source files.

Once I had this working and verified that there were no regressions....

2) I ported the AVX-512 crc32c code as a second patch adding the new HW checks into the previously created file for HW checks from patch 0001.

I reran all the basic tests again to make sure that the performance numbers were within the margin of error when compared to my original finding. This step showed similar numbers (see origin post) around 1.45X on average. I also made sure that if compiled with the AVX-512 features and ran on HW without these features the Postgres server still worked without throwing illegal instruction exceptions.

Please review the attached patches.

Thanks,
Paul

Attachments:

0001-v2-Refactor-Move-all-HW-checks-to-common-file.patchapplication/octet-stream; name=0001-v2-Refactor-Move-all-HW-checks-to-common-file.patchDownload
From 16693caca491f9d52cff463dfc85bbbd54df9064 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH] [Refactor] Move all HW checks to common file.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 configure                            |  12 +-
 configure.ac                         |   2 +-
 src/include/port/pg_bitutils.h       |   1 -
 src/include/port/pg_hw_feat_check.h  |  33 ++++++
 src/port/Makefile                    |   9 +-
 src/port/meson.build                 |   2 +-
 src/port/pg_bitutils.c               |  22 +---
 src/port/pg_crc32c_sse42_choose.c    |  27 +----
 src/port/pg_hw_feat_check.c          | 159 +++++++++++++++++++++++++++
 src/port/pg_popcount_avx512_choose.c | 102 -----------------
 10 files changed, 208 insertions(+), 161 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/configure b/configure
index 2abbeb2794..5be6fb4d5f 100755
--- a/configure
+++ b/configure
@@ -14868,7 +14868,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14914,7 +14914,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14938,7 +14938,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14983,7 +14983,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15007,7 +15007,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17674,7 +17674,7 @@ fi
 
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index c46ed2c591..2e64f53898 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2090,7 +2090,7 @@ if test x"$host_cpu" = x"x86_64"; then
     PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..b18710eeef 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -93,10 +94,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+# all versions of pg_hw_feat_check.o need CFLAGS_XSAVE
+pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
 # all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
 pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
diff --git a/src/port/meson.build b/src/port/meson.build
index ff54b7b53e..f8cafc4bd4 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,7 +86,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
-  ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..36e6949362 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,31 +20,8 @@
 
 #include "c.h"
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +38,4 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..455005add5
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- *    Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *    src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
-#endif							/* TRY_POPCNT_FAST */
-- 
2.34.1

0002-v2-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchapplication/octet-stream; name=0002-v2-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchDownload
From 0ea0c15d4e8c63fa129595d287d6935175e999a2 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Fri, 9 Aug 2024 08:00:09 -0700
Subject: [PATCH] [Feat] Add support for the SIMD AVX-512 crc32c algorithm.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  48 ++++++
 configure                           | 213 ++++++++++++++++++++-----
 configure.ac                        | 106 ++++++++-----
 meson.build                         |  40 ++++-
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_crc32c.h        |  23 +++
 src/include/port/pg_hw_feat_check.h |   9 +-
 src/port/Makefile                   |   5 +
 src/port/meson.build                |   4 +
 src/port/pg_crc32c_avx512.c         | 238 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c  |  42 +++++
 src/port/pg_hw_feat_check.c         |  71 ++++++++-
 12 files changed, 716 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 5be6fb4d5f..fca02db11d 100755
--- a/configure
+++ b/configure
@@ -17767,6 +17767,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17939,31 +18056,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17982,44 +18110,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 2e64f53898..ce68dce9d2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2152,31 +2163,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2191,29 +2213,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index cd711c6d01..1ddd1bed40 100644
--- a/meson.build
+++ b/meson.build
@@ -2245,6 +2245,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2259,12 +2287,20 @@ int main(void)
 }
 '''
 
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 979925cc2e..ea797f13f3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -739,6 +739,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..ade06dbcab 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..21ee8615e1 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,11 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
-#endif							/* PG_HW_FEAT_CHECK_H */
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern bool pg_crc32c_avx512_available(void);
+
+#endif						/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index b18710eeef..35445d88f1 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index f8cafc4bd4..31d50a7a3b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..3815e52ffc
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,238 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline static pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+pg_attribute_no_sanitize_alignment()
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+/*
+ * There's at least one block of 256.
+ */
+x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+x0 = _mm512_load_si512((__m512i *)k1k2);
+
+input += 256;
+length -= 256;
+
+/*
+ * Parallel fold blocks of 256, if any.
+ */
+while (length >= 256)
+{
+	x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+	x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+	x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+	x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+	x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+	x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+	x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+	x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+	y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+	y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+	y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+	y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+	x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+	x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+	x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+	x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+	input += 256;
+	length -= 256;
+		}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..4f11c278be
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 455005add5..35d6f9cdb1 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -132,9 +132,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -151,9 +202,17 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
-- 
2.34.1

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#20)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Thanks for the new patches.

On Thu, Aug 22, 2024 at 03:14:32PM +0000, Amonson, Paul D wrote:

I reran all the basic tests again to make sure that the performance
numbers were within the margin of error when compared to my original
finding. This step showed similar numbers (see origin post) around 1.45X
on average. I also made sure that if compiled with the AVX-512 features
and ran on HW without these features the Postgres server still worked
without throwing illegal instruction exceptions.

Upthread [0]/messages/by-id/20240612201135.kk77tiqcux77lgev@awork3.anarazel.de, Andres suggested dispatching to a different implementation
for compile-time-known small lengths. Have you looked into that? In your
original post, you noted a 14% regression for records smaller than 256
bytes, which is not an uncommon case for Postgres. IMO we should try to
mitigate that as much as possible.

[0]: /messages/by-id/20240612201135.kk77tiqcux77lgev@awork3.anarazel.de

--
nathan

#22Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#21)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Upthread [0], Andres suggested dispatching to a different implementation for
compile-time-known small lengths. Have you looked into that? In your
original post, you noted a 14% regression for records smaller than 256 bytes,
which is not an uncommon case for Postgres. IMO we should try to mitigate
that as much as possible.

So, without adding even more conditional tests (causing more latency), I can expose a new macro called COMP_CRC32C_SMALL that can be called from known locations where the size is known to be 20bytes or less (or any fixed size less than 256). Other than that, there is no method I know of to pre-decide calling a function based on input size. Is there any concrete thought on this?

Paul

#23Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#21)
3 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Upthread [0], Andres suggested dispatching to a different implementation for
compile-time-known small lengths. Have you looked into that? In your
original post, you noted a 14% regression for records smaller than 256 bytes,
which is not an uncommon case for Postgres. IMO we should try to mitigate
that as much as possible.

Hi,

Ok I added a patch that exposed a new macro CRC32C_COMP_SMALL for targeted fixed size < 256 use cases in Postgres. As for mitigating the regression in general, I have not been able to work up a fallback (i.e. <256 bytes) that doesn't involve runtime checks which cause latency. I also attempted to change the AVX512 fallback from the current algorithm in the avx512 implementation to the SSE original implementation, but I am not seeing any real difference for this use case in performance.

I am open to any other suggestions.

Paul

Attachments:

0001-v3-Refactor-Move-all-HW-checks-to-common-file.patchapplication/octet-stream; name=0001-v3-Refactor-Move-all-HW-checks-to-common-file.patchDownload
From 16693caca491f9d52cff463dfc85bbbd54df9064 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH] [Refactor] Move all HW checks to common file.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 configure                            |  12 +-
 configure.ac                         |   2 +-
 src/include/port/pg_bitutils.h       |   1 -
 src/include/port/pg_hw_feat_check.h  |  33 ++++++
 src/port/Makefile                    |   9 +-
 src/port/meson.build                 |   2 +-
 src/port/pg_bitutils.c               |  22 +---
 src/port/pg_crc32c_sse42_choose.c    |  27 +----
 src/port/pg_hw_feat_check.c          | 159 +++++++++++++++++++++++++++
 src/port/pg_popcount_avx512_choose.c | 102 -----------------
 10 files changed, 208 insertions(+), 161 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/configure b/configure
index 2abbeb2794..5be6fb4d5f 100755
--- a/configure
+++ b/configure
@@ -14868,7 +14868,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14914,7 +14914,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14938,7 +14938,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14983,7 +14983,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15007,7 +15007,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17674,7 +17674,7 @@ fi
 
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index c46ed2c591..2e64f53898 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2090,7 +2090,7 @@ if test x"$host_cpu" = x"x86_64"; then
     PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..b18710eeef 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -93,10 +94,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+# all versions of pg_hw_feat_check.o need CFLAGS_XSAVE
+pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
 # all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
 pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
diff --git a/src/port/meson.build b/src/port/meson.build
index ff54b7b53e..f8cafc4bd4 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,7 +86,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
-  ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..36e6949362 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,31 +20,8 @@
 
 #include "c.h"
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +38,4 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..455005add5
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- *    Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *    src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
-#endif							/* TRY_POPCNT_FAST */
-- 
2.34.1

0002-v3-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchapplication/octet-stream; name=0002-v3-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchDownload
From 6751e8a6114ce5ca9920c4e18ec2d2a48278bdde Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Fri, 9 Aug 2024 08:00:09 -0700
Subject: [PATCH] [Feat] Add support for the SIMD AVX-512 crc32c algorithm.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  48 ++++++
 configure                           | 213 ++++++++++++++++++++-----
 configure.ac                        | 106 +++++++-----
 meson.build                         |  40 ++++-
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_crc32c.h        |  23 +++
 src/include/port/pg_hw_feat_check.h |   9 +-
 src/port/Makefile                   |   5 +
 src/port/meson.build                |   4 +
 src/port/pg_crc32c_avx512.c         | 239 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c  |  42 +++++
 src/port/pg_hw_feat_check.c         |  71 ++++++++-
 12 files changed, 717 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 5be6fb4d5f..fca02db11d 100755
--- a/configure
+++ b/configure
@@ -17767,6 +17767,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17939,31 +18056,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17982,44 +18110,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 2e64f53898..ce68dce9d2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2152,31 +2163,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2191,29 +2213,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index cd711c6d01..1ddd1bed40 100644
--- a/meson.build
+++ b/meson.build
@@ -2245,6 +2245,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2259,12 +2287,20 @@ int main(void)
 }
 '''
 
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 979925cc2e..ea797f13f3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -739,6 +739,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..ade06dbcab 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..21ee8615e1 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,11 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
-#endif							/* PG_HW_FEAT_CHECK_H */
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern bool pg_crc32c_avx512_available(void);
+
+#endif						/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index b18710eeef..35445d88f1 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index f8cafc4bd4..31d50a7a3b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..be42a34a73
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,239 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline static pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+pg_attribute_no_sanitize_alignment()
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..4f11c278be
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 455005add5..35d6f9cdb1 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -132,9 +132,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -151,9 +202,17 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
-- 
2.34.1

0003-v3-Feat-Targeted-use-of-legacy-crc32c.patchapplication/octet-stream; name=0003-v3-Feat-Targeted-use-of-legacy-crc32c.patchDownload
From 69aaf3604708dc9b704d58ba8ef5592c5957e25b Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 26 Aug 2024 09:18:04 -0700
Subject: [PATCH] [Feat] Targeted use of legacy crc32c.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 configure                                   |  2 +-
 configure.ac                                |  2 +-
 src/backend/access/transam/xlog.c           |  8 ++++----
 src/backend/access/transam/xlogreader.c     |  2 +-
 src/backend/replication/logical/origin.c    |  8 ++++----
 src/backend/replication/logical/snapbuild.c |  4 ++--
 src/backend/utils/cache/relmapper.c         |  4 ++--
 src/bin/pg_resetwal/pg_resetwal.c           |  2 +-
 src/bin/pg_rewind/pg_rewind.c               |  2 +-
 src/common/controldata_utils.c              |  4 ++--
 src/include/port/pg_crc32c.h                | 17 +++++++++--------
 src/port/meson.build                        |  2 ++
 12 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/configure b/configure
index fca02db11d..f78b4c2890 100755
--- a/configure
+++ b/configure
@@ -18114,7 +18114,7 @@ else
 
 $as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o pg_crc32c_sse42.o pg_crc32c_sb8.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
 $as_echo "AVX 512 with runtime check" >&6; }
   else
diff --git a/configure.ac b/configure.ac
index ce68dce9d2..13172a0295 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2215,7 +2215,7 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o pg_crc32c_sse42.o pg_crc32c_sb8.o"
     AC_MSG_RESULT(AVX 512 with runtime check)
   else
     if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ee0fb0e28f..23c465d08f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -910,7 +910,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * header.
 		 */
 		rdata_crc = rechdr->xl_crc;
-		COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
+		COMP_CRC32C_SMALL(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
 		FIN_CRC32C(rdata_crc);
 		rechdr->xl_crc = rdata_crc;
 
@@ -4258,7 +4258,7 @@ WriteControlFile(void)
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile->crc);
-	COMP_CRC32C(ControlFile->crc,
+	COMP_CRC32C_SMALL(ControlFile->crc,
 				(char *) ControlFile,
 				offsetof(ControlFileData, crc));
 	FIN_CRC32C(ControlFile->crc);
@@ -4376,7 +4376,7 @@ ReadControlFile(void)
 
 	/* Now check the CRC. */
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc,
+	COMP_CRC32C_SMALL(crc,
 				(char *) ControlFile,
 				offsetof(ControlFileData, crc));
 	FIN_CRC32C(crc);
@@ -5101,7 +5101,7 @@ BootStrapXLOG(uint32 data_checksum_version)
 
 	INIT_CRC32C(crc);
 	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
-	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+	COMP_CRC32C_SMALL(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
 	record->xl_crc = crc;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 0c5e040a94..fdededf9a4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1200,7 +1200,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	INIT_CRC32C(crc);
 	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
 	/* include the record header last */
-	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+	COMP_CRC32C_SMALL(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(record->xl_crc, crc))
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 419e4814f0..3590f75c77 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -615,7 +615,7 @@ CheckPointReplicationOrigin(void)
 				 errmsg("could not write to file \"%s\": %m",
 						tmppath)));
 	}
-	COMP_CRC32C(crc, &magic, sizeof(magic));
+	COMP_CRC32C_SMALL(crc, &magic, sizeof(magic));
 
 	/* prevent concurrent creations/drops */
 	LWLockAcquire(ReplicationOriginLock, LW_SHARED);
@@ -658,7 +658,7 @@ CheckPointReplicationOrigin(void)
 							tmppath)));
 		}
 
-		COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+		COMP_CRC32C_SMALL(crc, &disk_state, sizeof(disk_state));
 	}
 
 	LWLockRelease(ReplicationOriginLock);
@@ -750,7 +750,7 @@ StartupReplicationOrigin(void)
 					 errmsg("could not read file \"%s\": read %d of %zu",
 							path, readBytes, sizeof(magic))));
 	}
-	COMP_CRC32C(crc, &magic, sizeof(magic));
+	COMP_CRC32C_SMALL(crc, &magic, sizeof(magic));
 
 	if (magic != REPLICATION_STATE_MAGIC)
 		ereport(PANIC,
@@ -790,7 +790,7 @@ StartupReplicationOrigin(void)
 							path, readBytes, sizeof(disk_state))));
 		}
 
-		COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
+		COMP_CRC32C_SMALL(crc, &disk_state, sizeof(disk_state));
 
 		if (last_state == max_replication_slots)
 			ereport(PANIC,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ae676145e6..6362dfabeb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1741,7 +1741,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	/* update catchange only on disk data */
 	ondisk->builder.catchange.xcnt = catchange_xcnt;
 
-	COMP_CRC32C(ondisk->checksum,
+	COMP_CRC32C_SMALL(ondisk->checksum,
 				&ondisk->builder,
 				sizeof(SnapBuild));
 
@@ -1917,7 +1917,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 	/* read SnapBuild */
 	SnapBuildRestoreContents(fd, (char *) &ondisk.builder, sizeof(SnapBuild), path);
-	COMP_CRC32C(checksum, &ondisk.builder, sizeof(SnapBuild));
+	COMP_CRC32C_SMALL(checksum, &ondisk.builder, sizeof(SnapBuild));
 
 	/* restore committed xacts information */
 	if (ondisk.builder.committed.xcnt > 0)
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 48d344ae3f..dbc1a4a1a6 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -854,7 +854,7 @@ read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
 
 	/* verify the CRC */
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc, (char *) map, offsetof(RelMapFile, crc));
+	COMP_CRC32C_SMALL(crc, (char *) map, offsetof(RelMapFile, crc));
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(crc, map->crc))
@@ -910,7 +910,7 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
 		elog(ERROR, "attempt to write bogus relation mapping");
 
 	INIT_CRC32C(newmap->crc);
-	COMP_CRC32C(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
+	COMP_CRC32C_SMALL(newmap->crc, (char *) newmap, offsetof(RelMapFile, crc));
 	FIN_CRC32C(newmap->crc);
 
 	/*
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..c09ae27dfe 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -593,7 +593,7 @@ read_controlfile(void)
 	{
 		/* Check the CRC. */
 		INIT_CRC32C(crc);
-		COMP_CRC32C(crc,
+		COMP_CRC32C_SMALL(crc,
 					buffer,
 					offsetof(ControlFileData, crc));
 		FIN_CRC32C(crc);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 323c35646c..ecfe340f00 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -1004,7 +1004,7 @@ checkControlFile(ControlFileData *ControlFile)
 
 	/* Calculate CRC */
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc, (char *) ControlFile, offsetof(ControlFileData, crc));
+	COMP_CRC32C_SMALL(crc, (char *) ControlFile, offsetof(ControlFileData, crc));
 	FIN_CRC32C(crc);
 
 	/* And simply compare it */
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 82309b2510..1cd9194120 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -134,7 +134,7 @@ retry:
 
 	/* Check the CRC. */
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc,
+	COMP_CRC32C_SMALL(crc,
 				(char *) ControlFile,
 				offsetof(ControlFileData, crc));
 	FIN_CRC32C(crc);
@@ -198,7 +198,7 @@ update_controlfile(const char *DataDir,
 
 	/* Recalculate CRC of control file */
 	INIT_CRC32C(ControlFile->crc);
-	COMP_CRC32C(ControlFile->crc,
+	COMP_CRC32C_SMALL(ControlFile->crc,
 				(char *) ControlFile,
 				offsetof(ControlFileData, crc));
 	FIN_CRC32C(ControlFile->crc);
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ade06dbcab..263a9ccaaf 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -40,12 +40,12 @@ typedef uint32 pg_crc32c;
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
@@ -53,7 +53,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 /* Use Intel AVX512 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
@@ -62,7 +61,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 
@@ -71,7 +69,6 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
@@ -83,13 +80,16 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C_SMALL(crc, data, len) \
+	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -98,7 +98,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
@@ -121,13 +120,15 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+#endif
 
+#if !defined(COMP_CRC32C_SMALL)
+#define COMP_CRC32C_SMALL(crc, data, len) COMP_CRC32C((crc), (data), (len))
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index 31d50a7a3b..6502687e7d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,6 +86,8 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-- 
2.34.1

#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#23)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Aug 26, 2024 at 05:09:35PM +0000, Amonson, Paul D wrote:

Ok I added a patch that exposed a new macro CRC32C_COMP_SMALL for
targeted fixed size < 256 use cases in Postgres. As for mitigating the
regression in general, I have not been able to work up a fallback (i.e.
<256 bytes) that doesn't involve runtime checks which cause latency. I
also attempted to change the AVX512 fallback from the current algorithm
in the avx512 implementation to the SSE original implementation, but I am
not seeing any real difference for this use case in performance.

I'm curious about where exactly the regression is coming from. Is it
possible that your build for the SSE 4.2 tests was using it
unconditionally, i.e., optimizing away the function pointer?

--
nathan

#25Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#24)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

I'm curious about where exactly the regression is coming from. Is it possible
that your build for the SSE 4.2 tests was using it unconditionally, i.e.,
optimizing away the function pointer?

I am calling the SSE 4.2 implementation directly; I am not even building the pg_sse42_*_choose.c file with the AVX512 choice. As best I can tell there is one extra function call and one extra int64 conditional test when bytes are <256 and a of course a JMP instruction to skip the AVX512 implementation.

Paul

#26Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#25)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Aug 26, 2024 at 06:44:55PM +0000, Amonson, Paul D wrote:

I'm curious about where exactly the regression is coming from. Is it possible
that your build for the SSE 4.2 tests was using it unconditionally, i.e.,
optimizing away the function pointer?

I am calling the SSE 4.2 implementation directly; I am not even building
the pg_sse42_*_choose.c file with the AVX512 choice. As best I can tell
there is one extra function call and one extra int64 conditional test
when bytes are <256 and a of course a JMP instruction to skip the AVX512
implementation.

And this still shows the ~14% regression in your original post?

--
nathan

#27Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#26)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

And this still shows the ~14% regression in your original post?

At the small buffer sizes the margin of error or "noise" is larger, 7-11%. My average could be just bad luck. It will take me a while to re-setup for full data collection runs but I can try it again if you like.

Paul

#28Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#27)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Aug 26, 2024 at 06:54:58PM +0000, Amonson, Paul D wrote:

And this still shows the ~14% regression in your original post?

At the small buffer sizes the margin of error or "noise" is larger,
7-11%. My average could be just bad luck. It will take me a while to
re-setup for full data collection runs but I can try it again if you
like.

IMHO that would be useful to establish the current state of the patch set
from a performance standpoint, especially since you've added code intended
to mitigate the regression.

+#define COMP_CRC32C_SMALL(crc, data, len) \
+	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))

My interpretation of Andres's upthread suggestion is that we'd add the
length check within the macro instead of introducing a separate one. We'd
expect the compiler to optimize out comparisons for small lengths known at
compile time and always call the existing implementation (which may still
involve a function pointer in most cases).

--
nathan

#29Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#28)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

IMHO that would be useful to establish the current state of the patch set from
a performance standpoint, especially since you've added code intended to
mitigate the regression.

Ok.

+#define COMP_CRC32C_SMALL(crc, data, len) \
+	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))

My interpretation of Andres's upthread suggestion is that we'd add the length
check within the macro instead of introducing a separate one. We'd expect
the compiler to optimize out comparisons for small lengths known at compile
time and always call the existing implementation (which may still involve a
function pointer in most cases).

How does the m4/compiler know the difference between a const "len" and a dynamic "len"? I already when the code and changed constant sizes (structure sizes) to the new macro. Can you give an example of how this could work?

Paul

#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#29)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Aug 26, 2024 at 07:15:47PM +0000, Amonson, Paul D wrote:

+#define COMP_CRC32C_SMALL(crc, data, len) \
+	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))

My interpretation of Andres's upthread suggestion is that we'd add the length
check within the macro instead of introducing a separate one. We'd expect
the compiler to optimize out comparisons for small lengths known at compile
time and always call the existing implementation (which may still involve a
function pointer in most cases).

How does the m4/compiler know the difference between a const "len" and a
dynamic "len"? I already when the code and changed constant sizes
(structure sizes) to the new macro. Can you give an example of how this
could work?

Things like sizeof() and offsetof() are known at compile time, so the
compiler will recognize when a condition is always true or false and
optimize it out accordingly. In cases where the value cannot be known at
compile time, checking the length in the macro and dispatching to a
different implementation may still be advantageous, especially when the
different implementation doesn't involve function pointers.

--
nathan

#31Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#30)
3 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Things like sizeof() and offsetof() are known at compile time, so the compiler
will recognize when a condition is always true or false and optimize it out
accordingly. In cases where the value cannot be known at compile time,
checking the length in the macro and dispatching to a different
implementation may still be advantageous, especially when the different
implementation doesn't involve function pointers.

Ok, multiple issues resolved and have new numbers:

1) Implemented the new COMP_CRC32 macro with the comparison and choice of avx512 vs. SSE42 at compile time for static structures.
2) You were right about the baseline numbers, it seems that the binaries were compiled with the direct call version of the SSE 4.2 CRC implementation thus avoiding the function pointer. I rebuilt with USE_SSE42_CRC32C_WITH_RUNTIME_CHECK for the numbers below.
3) ran through all the tests again and ended up with no regression (meaning run sets would fall either 0.5% below or 1.5% above the baseline and the margin of error was MUCH tighter this time at ~3%. :)

New Table of Rates (looks correct with fixed font width) below:

+------------------+----------------+----------------+------------------+-------+------+
| Rate in bytes/us |    SDP (SPR)   |       m6i      |       m7i        |       |      |
+------------------+----------------+----------------+------------------+ Multi-|      |
| higher is better | SSE42  | AVX512 | SSE42 | AVX512 | SSE42  | AVX512 | plier |  %   |
+==================+=================+=======+========+========+========+=======+======+
| AVG Rate 64-8192 | 10,095 | 82,101 | 8,591 | 38,652 | 11,867 | 83,194 | 6.68  | 568% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+
| AVG Rate 64-255  |  9,034 |  9,136 | 7,619 |  7,437 |  9,030 |  9,293 | 1.01  |   1% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+

* With a data profile of 99% buffer sizes <256 bytes the improvement is still 6% and will not regress (except withing the margin of error)!
* There is not a regression anymore (previously showing a 14% regression).

Thanks for the pointers!!!
Paul

Attachments:

0001-v4-Refactor-Move-all-HW-checks-to-common-file.patchapplication/octet-stream; name=0001-v4-Refactor-Move-all-HW-checks-to-common-file.patchDownload
From 16693caca491f9d52cff463dfc85bbbd54df9064 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH] [Refactor] Move all HW checks to common file.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 configure                            |  12 +-
 configure.ac                         |   2 +-
 src/include/port/pg_bitutils.h       |   1 -
 src/include/port/pg_hw_feat_check.h  |  33 ++++++
 src/port/Makefile                    |   9 +-
 src/port/meson.build                 |   2 +-
 src/port/pg_bitutils.c               |  22 +---
 src/port/pg_crc32c_sse42_choose.c    |  27 +----
 src/port/pg_hw_feat_check.c          | 159 +++++++++++++++++++++++++++
 src/port/pg_popcount_avx512_choose.c | 102 -----------------
 10 files changed, 208 insertions(+), 161 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/configure b/configure
index 2abbeb2794..5be6fb4d5f 100755
--- a/configure
+++ b/configure
@@ -14868,7 +14868,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14914,7 +14914,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14938,7 +14938,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14983,7 +14983,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15007,7 +15007,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17674,7 +17674,7 @@ fi
 
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index c46ed2c591..2e64f53898 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2090,7 +2090,7 @@ if test x"$host_cpu" = x"x86_64"; then
     PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index db7c02117b..b18710eeef 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -93,10 +94,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+# all versions of pg_hw_feat_check.o need CFLAGS_XSAVE
+pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
 # all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
 pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
diff --git a/src/port/meson.build b/src/port/meson.build
index ff54b7b53e..f8cafc4bd4 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,7 +86,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
-  ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..36e6949362 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,31 +20,8 @@
 
 #include "c.h"
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +38,4 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..455005add5
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- *    Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *    src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
-#endif							/* TRY_POPCNT_FAST */
-- 
2.34.1

0002-v4-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchapplication/octet-stream; name=0002-v4-Feat-Add-support-for-the-SIMD-AVX-512-crc32c-algorit.patchDownload
From 6751e8a6114ce5ca9920c4e18ec2d2a48278bdde Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Fri, 9 Aug 2024 08:00:09 -0700
Subject: [PATCH] [Feat] Add support for the SIMD AVX-512 crc32c algorithm.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  48 ++++++
 configure                           | 213 ++++++++++++++++++++-----
 configure.ac                        | 106 +++++++-----
 meson.build                         |  40 ++++-
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_crc32c.h        |  23 +++
 src/include/port/pg_hw_feat_check.h |   9 +-
 src/port/Makefile                   |   5 +
 src/port/meson.build                |   4 +
 src/port/pg_crc32c_avx512.c         | 239 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c  |  42 +++++
 src/port/pg_hw_feat_check.c         |  71 ++++++++-
 12 files changed, 717 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index 5be6fb4d5f..fca02db11d 100755
--- a/configure
+++ b/configure
@@ -17767,6 +17767,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17939,31 +18056,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17982,44 +18110,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 2e64f53898..ce68dce9d2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2152,31 +2163,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2191,29 +2213,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index cd711c6d01..1ddd1bed40 100644
--- a/meson.build
+++ b/meson.build
@@ -2245,6 +2245,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2259,12 +2287,20 @@ int main(void)
 }
 '''
 
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 979925cc2e..ea797f13f3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -739,6 +739,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..ade06dbcab 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..21ee8615e1 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,11 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
-#endif							/* PG_HW_FEAT_CHECK_H */
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern bool pg_crc32c_avx512_available(void);
+
+#endif						/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index b18710eeef..35445d88f1 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index f8cafc4bd4..31d50a7a3b 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..be42a34a73
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,239 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline static pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+pg_attribute_no_sanitize_alignment()
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..4f11c278be
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 455005add5..35d6f9cdb1 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -132,9 +132,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -151,9 +202,17 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
-- 
2.34.1

0003-v4-Feat-New-COMP_CRC32C-macro-for-AVX512-simplify-code-.patchapplication/octet-stream; name=0003-v4-Feat-New-COMP_CRC32C-macro-for-AVX512-simplify-code-.patchDownload
From 993837dfa76beec52a41fd54cb44dd77a0f0d6b5 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 27 Aug 2024 08:26:19 -0700
Subject: [PATCH] [Feat] New COMP_CRC32C macro for AVX512, simplify code some.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
 configure                    |  2 +-
 configure.ac                 |  2 +-
 src/include/port/pg_crc32c.h | 17 ++++++-------
 src/port/meson.build         |  1 +
 src/port/pg_crc32c_avx512.c  | 46 ++----------------------------------
 5 files changed, 12 insertions(+), 56 deletions(-)

diff --git a/configure b/configure
index fca02db11d..7dcc4b9f5d 100755
--- a/configure
+++ b/configure
@@ -18114,7 +18114,7 @@ else
 
 $as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
 $as_echo "AVX 512 with runtime check" >&6; }
   else
diff --git a/configure.ac b/configure.ac
index ce68dce9d2..99ab8bd5d6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2215,7 +2215,7 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     AC_MSG_RESULT(AVX 512 with runtime check)
   else
     if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ade06dbcab..3f83d9f815 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -40,12 +40,12 @@ typedef uint32 pg_crc32c;
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
@@ -53,7 +53,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 /* Use Intel AVX512 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
@@ -62,7 +61,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 
@@ -71,7 +69,6 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
@@ -82,14 +79,17 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
  * they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+	((crc) = ((len) < 256 ? \
+		pg_comp_crc32c_sse42((crc), (data), (len)) : \
+		pg_comp_crc32c((crc), (data), (len))))
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -98,7 +98,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
@@ -121,13 +120,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index 31d50a7a3b..6a796411b4 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -87,6 +87,7 @@ replace_funcs_pos = [
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index be42a34a73..98353f7e1d 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -18,48 +18,6 @@
 
 #include "port/pg_crc32c.h"
 
-/*
- * Process eight bytes of data at a time.
- *
- * NB: We do unaligned accesses here. The Intel architecture allows that,
- * and performance testing didn't show any performance gain from aligning
- * the begin address.
- */
-pg_attribute_no_sanitize_alignment()
-inline static pg_crc32c
-crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
-{
-	const unsigned char *pend = p + length;
-
-	/*
-	 * Process eight bytes of data at a time.
-	 *
-	 * NB: We do unaligned accesses here. The Intel architecture allows that,
-	 * and performance testing didn't show any performance gain from aligning
-	 * the begin address.
-	 */
-	while (p + 8 <= pend)
-	{
-		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
-		p += 8;
-	}
-
-	/* Process remaining full four bytes if any */
-	if (p + 4 <= pend)
-	{
-		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
-		p += 4;
-	}
-
-	/* Process any remaining bytes one at a time. */
-	while (p < pend)
-	{
-		crc = _mm_crc32_u8(crc, *p);
-		p++;
-	}
-
-	return crc;
-}
 
 /*******************************************************************
  * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
@@ -233,7 +191,7 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 	}
 
 	/*
-	 * Finish any remaining bytes.
+	 * Finish any remaining bytes with legacy AVX algorithm.
 	 */
-	return crc32c_fallback(crc, input, length);
+	return pg_comp_crc32c_sse42(crc, input, length);
 }
-- 
2.34.1

#32Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#31)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi all,

I will be retiring from Intel at the end of this week. I wanted to introduce the engineer who will be taking over the CRC32c proposal and commit fest entry.

Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>

I have brought him up to speed and he will be the go-to for technical review comments and questions. Please welcome him into the community.

Thanks,
Paul

#33Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Amonson, Paul D (#32)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Thank you for the introduction, Paul.

Hi all, I'm currently in the process of reviewing and analyzing Paul's patch. In the meantime, I'm open to addressing any questions or feedback you may have.

Show quoted text

Hi all,

I will be retiring from Intel at the end of this week. I wanted to introduce the
engineer who will be taking over the CRC32c proposal and commit fest entry.

Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>

I have brought him up to speed and he will be the go-to for technical review
comments and questions. Please welcome him into the community.

Thanks,
Paul

#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#33)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Oct 08, 2024 at 08:19:27PM +0000, Devulapalli, Raghuveer wrote:

Hi all, I'm currently in the process of reviewing and analyzing Paul's
patch. In the meantime, I'm open to addressing any questions or feedback
you may have.

I've proposed a patch to move the existing AVX-512 code in Postgres to use
__attribute__((target("..."))) instead of per-translation-unit compiler
flags [0]/messages/by-id/ZxAqRG1-8fJLMRUY@nathan. We should likely do something similar for this one.

[0]: /messages/by-id/ZxAqRG1-8fJLMRUY@nathan

--
nathan

#35Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#34)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

I've proposed a patch to move the existing AVX-512 code in Postgres to use
__attribute__((target("..."))) instead of per-translation-unit compiler flags [0]. We
should likely do something similar for this one.

[0] /messages/by-id/ZxAqRG1-8fJLMRUY@nathan

I assume this will be committed separately and then I can rebase?

Show quoted text

--
nathan

#36Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#34)
6 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Here are the latest set of patches built on top of your patch to use __attribute__(target) for AVX-512 popcount. Couple of changes made:

(1) The SSE42 and AVX-512 CRC32C also use function attributes to build with ISA specific flag.
(2) Fixes a bug in the earlier version of patch which had performance regressions on SKX because of a bug in the compile time and runtime checks involving the AVX-512 CRC32C code.

Raghuveer

Show quoted text

-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Friday, October 18, 2024 9:32 AM
To: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Cc: Bruce Momjian <bruce@momjian.us>; Alvaro Herrera <alvherre@alvh.no-
ip.org>; Andres Freund <andres@anarazel.de>; pgsql-
hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
Subject: Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Oct 08, 2024 at 08:19:27PM +0000, Devulapalli, Raghuveer wrote:

Hi all, I'm currently in the process of reviewing and analyzing Paul's
patch. In the meantime, I'm open to addressing any questions or
feedback you may have.

I've proposed a patch to move the existing AVX-512 code in Postgres to use
__attribute__((target("..."))) instead of per-translation-unit compiler flags [0]. We
should likely do something similar for this one.

[0] /messages/by-id/ZxAqRG1-8fJLMRUY@nathan

--
nathan

Attachments:

v5-0001-Add-a-Postgres-SQL-function-for-crc32c-testing.patchapplication/octet-stream; name=v5-0001-Add-a-Postgres-SQL-function-for-crc32c-testing.patchDownload
From b601e7b4ee9f25fd32e9d8d056bb20a03d755a8a Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v5 1/6] Add a Postgres SQL function for crc32c testing.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/test_crc32c/Makefile         | 20 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 41 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 4 files changed, 66 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..5273158faf
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,41 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+
+#include "port/pg_crc32c.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	pg_crc32c		crc		= 0xFFFFFFFF;
+	const char*		data	= malloc((size_t)num);
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		memset((void*)data, count, (size_t)Min(16,num));
+		crc = COMP_CRC32C(crc, data, num);
+	}
+
+	FIN_CRC32C(crc);
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.43.0

v5-0002-Move-all-HW-checks-to-common-file.patchapplication/octet-stream; name=v5-0002-Move-all-HW-checks-to-common-file.patchDownload
From da26645ec8515e0e6d91e2311a83c3bb6649017e Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v5 2/6] Move all HW checks to common file.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 configure                            |  12 +-
 configure.ac                         |   2 +-
 src/include/port/pg_bitutils.h       |   1 -
 src/include/port/pg_hw_feat_check.h  |  33 ++++++
 src/port/Makefile                    |   9 +-
 src/port/meson.build                 |   2 +-
 src/port/pg_bitutils.c               |  22 +---
 src/port/pg_crc32c_sse42_choose.c    |  27 +----
 src/port/pg_hw_feat_check.c          | 159 +++++++++++++++++++++++++++
 src/port/pg_popcount_avx512_choose.c | 102 -----------------
 10 files changed, 208 insertions(+), 161 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/configure b/configure
index 3a577e463b..cd43a63892 100755
--- a/configure
+++ b/configure
@@ -14731,7 +14731,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14777,7 +14777,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14801,7 +14801,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14846,7 +14846,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14870,7 +14870,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17440,7 +17440,7 @@ fi
 
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index 55f6c46d33..5a7cc3f6f2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2072,7 +2072,7 @@ if test x"$host_cpu" = x"x86_64"; then
     PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 9324ec2d9f..aecfe5f62b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -92,10 +93,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+# all versions of pg_hw_feat_check.o need CFLAGS_XSAVE
+pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
 # all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
 pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
diff --git a/src/port/meson.build b/src/port/meson.build
index 1150966ab7..907ddce33f 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -85,7 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
-  ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..36e6949362 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,31 +20,8 @@
 
 #include "c.h"
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +38,4 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..455005add5
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- *    Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *    src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
-#endif							/* TRY_POPCNT_FAST */
-- 
2.43.0

v5-0003-Add-support-for-the-SIMD-AVX-512-crc32c-algorithm.patchapplication/octet-stream; name=v5-0003-Add-support-for-the-SIMD-AVX-512-crc32c-algorithm.patchDownload
From 99a17e7097625f7029695d2e41f7d414fbd020d8 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Fri, 9 Aug 2024 08:00:09 -0700
Subject: [PATCH v5 3/6] Add support for the SIMD AVX-512 crc32c algorithm.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 config/c-compiler.m4                |  48 ++++++
 configure                           | 213 ++++++++++++++++++++-----
 configure.ac                        | 106 +++++++-----
 meson.build                         |  40 ++++-
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_crc32c.h        |  23 +++
 src/include/port/pg_hw_feat_check.h |   9 +-
 src/port/Makefile                   |   5 +
 src/port/meson.build                |   4 +
 src/port/pg_crc32c_avx512.c         | 239 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c  |  42 +++++
 src/port/pg_hw_feat_check.c         |  71 ++++++++-
 12 files changed, 717 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index cd43a63892..474282e8ba 100755
--- a/configure
+++ b/configure
@@ -17533,6 +17533,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17705,31 +17822,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17748,44 +17876,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 5a7cc3f6f2..5d7ececfbc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2089,6 +2089,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2134,31 +2145,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2173,29 +2195,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index 58e67975e8..fab6373fef 100644
--- a/meson.build
+++ b/meson.build
@@ -2242,6 +2242,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2256,12 +2284,20 @@ int main(void)
 }
 '''
 
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 427030f31a..65623df7f9 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -712,6 +712,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..ade06dbcab 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..21ee8615e1 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,11 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
-#endif							/* PG_HW_FEAT_CHECK_H */
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern bool pg_crc32c_avx512_available(void);
+
+#endif						/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index aecfe5f62b..b72deed50e 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -88,6 +88,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 907ddce33f..e3b05622d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -83,6 +83,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..be42a34a73
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,239 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline static pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+pg_attribute_no_sanitize_alignment()
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..4f11c278be
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 455005add5..35d6f9cdb1 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -132,9 +132,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -151,9 +202,17 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
-- 
2.43.0

v5-0004-New-COMP_CRC32C-macro-for-AVX512-simplify-code-so.patchapplication/octet-stream; name=v5-0004-New-COMP_CRC32C-macro-for-AVX512-simplify-code-so.patchDownload
From 558da005e91b71517d788498c36d66c900366bfe Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 27 Aug 2024 08:26:19 -0700
Subject: [PATCH v5 4/6] New COMP_CRC32C macro for AVX512, simplify code some.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 configure                    |  2 +-
 configure.ac                 |  2 +-
 src/include/port/pg_crc32c.h | 17 ++++++-------
 src/port/meson.build         |  1 +
 src/port/pg_crc32c_avx512.c  | 46 ++----------------------------------
 5 files changed, 12 insertions(+), 56 deletions(-)

diff --git a/configure b/configure
index 474282e8ba..8af995e48f 100755
--- a/configure
+++ b/configure
@@ -17880,7 +17880,7 @@ else
 
 $as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
 $as_echo "AVX 512 with runtime check" >&6; }
   else
diff --git a/configure.ac b/configure.ac
index 5d7ececfbc..a8c7911754 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2197,7 +2197,7 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     AC_MSG_RESULT(AVX 512 with runtime check)
   else
     if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ade06dbcab..3f83d9f815 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -40,12 +40,12 @@ typedef uint32 pg_crc32c;
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
@@ -53,7 +53,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 /* Use Intel AVX512 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
@@ -62,7 +61,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 
@@ -71,7 +69,6 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
@@ -82,14 +79,17 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
  * they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+	((crc) = ((len) < 256 ? \
+		pg_comp_crc32c_sse42((crc), (data), (len)) : \
+		pg_comp_crc32c((crc), (data), (len))))
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -98,7 +98,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
@@ -121,13 +120,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index e3b05622d1..b53c33c8eb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,6 +86,7 @@ replace_funcs_pos = [
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index be42a34a73..98353f7e1d 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -18,48 +18,6 @@
 
 #include "port/pg_crc32c.h"
 
-/*
- * Process eight bytes of data at a time.
- *
- * NB: We do unaligned accesses here. The Intel architecture allows that,
- * and performance testing didn't show any performance gain from aligning
- * the begin address.
- */
-pg_attribute_no_sanitize_alignment()
-inline static pg_crc32c
-crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
-{
-	const unsigned char *pend = p + length;
-
-	/*
-	 * Process eight bytes of data at a time.
-	 *
-	 * NB: We do unaligned accesses here. The Intel architecture allows that,
-	 * and performance testing didn't show any performance gain from aligning
-	 * the begin address.
-	 */
-	while (p + 8 <= pend)
-	{
-		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
-		p += 8;
-	}
-
-	/* Process remaining full four bytes if any */
-	if (p + 4 <= pend)
-	{
-		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
-		p += 4;
-	}
-
-	/* Process any remaining bytes one at a time. */
-	while (p < pend)
-	{
-		crc = _mm_crc32_u8(crc, *p);
-		p++;
-	}
-
-	return crc;
-}
 
 /*******************************************************************
  * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
@@ -233,7 +191,7 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 	}
 
 	/*
-	 * Finish any remaining bytes.
+	 * Finish any remaining bytes with legacy AVX algorithm.
 	 */
-	return crc32c_fallback(crc, input, length);
+	return pg_comp_crc32c_sse42(crc, input, length);
 }
-- 
2.43.0

v5-0005-use-__attribute__-target-.-for-AVX-512-stuff.patchapplication/octet-stream; name=v5-0005-use-__attribute__-target-.-for-AVX-512-stuff.patchDownload
From a495124ee42cb8f9f206f719b9f2235aff715963 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v5 5/6] use __attribute__((target(...))) for AVX-512 stuff

Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 config/c-compiler.m4          |  60 ++++++-------
 configure                     | 163 ++++++++--------------------------
 configure.ac                  |  17 +---
 meson.build                   |  17 +---
 src/Makefile.global.in        |   5 --
 src/include/c.h               |  10 +++
 src/makefiles/meson.build     |   4 +-
 src/port/Makefile             |   7 +-
 src/port/meson.build          |   6 +-
 src/port/pg_popcount_avx512.c |  86 +++++++++++++++++-
 10 files changed, 171 insertions(+), 204 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 1d33932cb5..33df694ae7 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -748,20 +748,20 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("xsave")))
+    static int xsave_test(void)
+    {
+      return _xgetbv(0) & 0xe0;
+    }],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -773,29 +773,27 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("avx512vpopcntdq","avx512bw")))
+    static int popcount_test(void)
+    {
+      const char buf@<:@sizeof(__m512i)@:>@;
+      PG_INT64_TYPE popcnt = 0;
+      __m512i accum = _mm512_setzero_si512();
+      const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+      const __m512i cnt = _mm512_popcnt_epi64(val);
+      accum = _mm512_add_epi64(accum, cnt);
+      popcnt = _mm512_reduce_add_epi64(accum);
+      return (int) popcnt;
+    }],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="$1"
   pgac_avx512_popcnt_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 8af995e48f..38e7b1889b 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17270,185 +17267,99 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
-if ${pgac_cv_xsave_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5
+$as_echo_n "checking for _xgetbv... " >&6; }
+if ${pgac_cv_xsave_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <immintrin.h>
-int
-main ()
-{
-return _xgetbv(0) & 0xe0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_xsave_intrinsics_=yes
-else
-  pgac_cv_xsave_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
-$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
-if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
-  CFLAGS_XSAVE=""
-  pgac_xsave_intrinsics=yes
-fi
-
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
-if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mxsave"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("xsave")))
+    static int xsave_test(void)
+    {
+      return _xgetbv(0) & 0xe0;
+    }
 int
 main ()
 {
-return _xgetbv(0) & 0xe0;
+return xsave_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_xsave_intrinsics__mxsave=yes
+  pgac_cv_xsave_intrinsics=yes
 else
-  pgac_cv_xsave_intrinsics__mxsave=no
+  pgac_cv_xsave_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
-$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
-if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
-  CFLAGS_XSAVE="-mxsave"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics" >&5
+$as_echo "$pgac_cv_xsave_intrinsics" >&6; }
+if test x"$pgac_cv_xsave_intrinsics" = x"yes"; then
   pgac_xsave_intrinsics=yes
 fi
 
-fi
 if test x"$pgac_xsave_intrinsics" = x"yes"; then
 
 $as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
 
 fi
 
-
 # Check for AVX-512 popcount intrinsics
 #
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
 if test x"$host_cpu" = x"x86_64"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <immintrin.h>
-int
-main ()
-{
-const char buf[sizeof(__m512i)];
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_popcnt_intrinsics_=yes
-else
-  pgac_cv_avx512_popcnt_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
-  CFLAGS_POPCNT=""
-  pgac_avx512_popcnt_intrinsics=yes
-fi
-
-  if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
-    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("avx512vpopcntdq","avx512bw")))
+    static int popcount_test(void)
+    {
+      const char buf[sizeof(__m512i)];
+      PG_INT64_TYPE popcnt = 0;
+      __m512i accum = _mm512_setzero_si512();
+      const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+      const __m512i cnt = _mm512_popcnt_epi64(val);
+      accum = _mm512_add_epi64(accum, cnt);
+      popcnt = _mm512_reduce_add_epi64(accum);
+      return (int) popcnt;
+    }
 int
 main ()
 {
-const char buf[sizeof(__m512i)];
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;
+return popcount_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+  pgac_cv_avx512_popcnt_intrinsics=yes
 else
-  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+  pgac_cv_avx512_popcnt_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
-  CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics" = x"yes"; then
   pgac_avx512_popcnt_intrinsics=yes
 fi
 
-  fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
   fi
 fi
 
-
-
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index a8c7911754..70c78d11fa 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,32 +2052,19 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-PGAC_XSAVE_INTRINSICS([])
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
-  PGAC_XSAVE_INTRINSICS([-mxsave])
-fi
+PGAC_XSAVE_INTRINSICS()
 if test x"$pgac_xsave_intrinsics" = x"yes"; then
   AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
 fi
-AC_SUBST(CFLAGS_XSAVE)
 
 # Check for AVX-512 popcount intrinsics
 #
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
 if test x"$host_cpu" = x"x86_64"; then
-  PGAC_AVX512_POPCNT_INTRINSICS([])
-  if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
-    PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
-  fi
+  PGAC_AVX512_POPCNT_INTRINSICS()
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
-AC_SUBST(CFLAGS_POPCNT)
-AC_SUBST(PG_POPCNT_OBJS)
 
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
diff --git a/meson.build b/meson.build
index fab6373fef..aefb64c094 100644
--- a/meson.build
+++ b/meson.build
@@ -2157,25 +2157,20 @@ endforeach
 # Check for the availability of XSAVE intrinsics.
 ###############################################################
 
-cflags_xsave = []
 if host_cpu == 'x86' or host_cpu == 'x86_64'
 
   prog = '''
 #include <immintrin.h>
 
+__attribute__((target("xsave")))
 int main(void)
 {
     return _xgetbv(0) & 0xe0;
 }
 '''
 
-  if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
-        args: test_c_args)
+  if cc.links(prog, name: 'XSAVE intrinsics', args: test_c_args)
     cdata.set('HAVE_XSAVE_INTRINSICS', 1)
-  elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
-        args: test_c_args + ['-mxsave'])
-    cdata.set('HAVE_XSAVE_INTRINSICS', 1)
-    cflags_xsave += '-mxsave'
   endif
 
 endif
@@ -2185,12 +2180,12 @@ endif
 # Check for the availability of AVX-512 popcount intrinsics.
 ###############################################################
 
-cflags_popcnt = []
 if host_cpu == 'x86_64'
 
   prog = '''
 #include <immintrin.h>
 
+__attribute__((target("avx512vpopcntdq","avx512bw")))
 int main(void)
 {
     const char buf[sizeof(__m512i)];
@@ -2205,13 +2200,9 @@ int main(void)
 }
 '''
 
-  if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+  if cc.links(prog, name: 'AVX-512 popcount',
         args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
     cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
-  elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
-        args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
-    cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
-    cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
   endif
 
 endif
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b4976..45696247e9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,9 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
 CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
-CFLAGS_POPCNT = @CFLAGS_POPCNT@
 CFLAGS_CRC = @CFLAGS_CRC@
-CFLAGS_XSAVE = @CFLAGS_XSAVE@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
 CXXFLAGS = @CXXFLAGS@
@@ -762,9 +760,6 @@ LIBOBJS = @LIBOBJS@
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
 
-# files needed for the chosen popcount implementation
-PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
-
 LIBS := -lpgcommon -lpgport $(LIBS)
 
 # to make ws2_32.lib the last library
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..6f5ca25542 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -174,6 +174,16 @@
 #define pg_attribute_nonnull(...)
 #endif
 
+/*
+ * pg_attribute_target allows specifying different target options that the
+ * function should be compiled with (e.g., for using special CPU instructions).
+ */
+#if __has_attribute (target)
+#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
+#else
+#define pg_attribute_target(...)
+#endif
+
 /*
  * Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
  * used in assert-enabled builds, to avoid compiler warnings about unused
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e927584..479aa08420 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,10 +102,8 @@ pgxs_kv = {
     ' '.join(cflags_no_missing_var_decls),
 
   'CFLAGS_CRC': ' '.join(cflags_crc),
-  'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
   'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
   'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
-  'CFLAGS_XSAVE': ' '.join(cflags_xsave),
 
   'LDFLAGS': var_ldflags,
   'LDFLAGS_EX': var_ldflags_ex,
@@ -181,7 +179,7 @@ pgxs_empty = [
   'WANTED_LANGUAGES',
 
   # Not needed because we don't build the server / PLs with the generated makefile
-  'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
+  'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
   'DTRACEFLAGS', # only server has dtrace probes
 
   'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index b72deed50e..42c02f1b3d 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,7 +38,6 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
-	$(PG_POPCNT_OBJS) \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
@@ -46,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_hw_feat_check.o \
+	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -103,11 +103,6 @@ pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
 pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
 pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
-# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
-pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index b53c33c8eb..3f17cd2f8d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -89,7 +90,6 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
@@ -105,8 +105,8 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
-pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
-pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
+pgport_cflags = {'crc': cflags_crc}
+pgport_sources_cflags = {'crc': []}
 
 foreach f : replace_funcs_neg
   func = f.get(0)
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 9d3149e2d0..b598e86554 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,7 +12,17 @@
  */
 #include "c.h"
 
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 #include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
 
 #include "port/pg_bitutils.h"
 
@@ -21,12 +31,82 @@
  * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
  * the function pointers that are only used when TRY_POPCNT_FAST is set.
  */
-#ifdef TRY_POPCNT_FAST
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+
+/*
+ * Does CPUID say there's support for XSAVE instructions?
+ */
+static inline bool
+xsave_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+	return xsave_available() &&
+		zmm_regs_available() &&
+		avx512_popcnt_available();
+}
 
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
  */
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
@@ -82,6 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
  * pg_popcount_masked_avx512
  *		Returns the number of 1-bits in buf after applying the mask to each byte
  */
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
 uint64
 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 {
@@ -138,4 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 	return _mm512_reduce_add_epi64(accum);
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_FAST &&
+								 * USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.43.0

v5-0006-Use-__attribute__-target-.-for-SSE42-and-AVX512-C.patchapplication/octet-stream; name=v5-0006-Use-__attribute__-target-.-for-SSE42-and-AVX512-C.patchDownload
From 8603b8c005e61857530ec72f78aa7ede36d3b981 Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Mon, 21 Oct 2024 14:26:22 -0700
Subject: [PATCH v5 6/6]  Use __attribute__(target(...)) for SSE42 and AVX512
 CRC32C

---
 config/c-compiler.m4                          |  88 ++---
 configure                                     | 350 ++++++------------
 configure.ac                                  | 130 +++----
 meson.build                                   |  30 +-
 src/include/pg_config.h.in                    |   6 +-
 src/include/pg_cpu.h                          |  23 ++
 src/include/port/pg_crc32c.h                  |  71 +---
 src/port/Makefile                             |  10 -
 src/port/meson.build                          |  22 +-
 src/port/pg_crc32c_avx512.c                   |   5 +
 src/port/pg_crc32c_avx512_choose.c            |  42 ---
 src/port/pg_crc32c_sse42.c                    |   4 +
 ..._sse42_choose.c => pg_crc32c_x86_choose.c} |  27 +-
 src/port/pg_hw_feat_check.c                   |   3 +
 src/port/pg_popcount_avx512.c                 |  78 +---
 15 files changed, 297 insertions(+), 592 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 delete mode 100644 src/port/pg_crc32c_avx512_choose.c
 rename src/port/{pg_crc32c_sse42_choose.c => pg_crc32c_x86_choose.c} (58%)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 33df694ae7..d7b3ceeb60 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -608,21 +608,22 @@ fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 # An optional compiler flag can be passed as argument (e.g. -msse4.2). If the
 # intrinsics are supported, sets pgac_sse42_crc32_intrinsics, and CFLAGS_CRC.
 AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>],
-  [unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>
+    __attribute__((target("sse4.2")))
+    static int crc32_sse42_test(void)
+    {
+      unsigned int crc = 0;
+      crc = _mm_crc32_u8(crc, 0);
+      crc = _mm_crc32_u32(crc, 0);
+      /* return computed value, to prevent the above being optimized away */
+      return crc == 0;
+    }],
+  [return crc32_sse42_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_CRC="$1"
   pgac_sse42_crc32_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -639,44 +640,45 @@ undefine([Ac_cachevar])dnl
 # An optional compiler flag can be passed as arguments (e.g. -msse4.2
 # -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
 # pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+
 AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [const unsigned long k1k2[[8]] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[[512]];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("avx512f","avx512vl","vpclmulqdq")))
+    static int crc32_avx512_test(void)
+    {
+      const unsigned long k1k2[[8]] = {
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+      unsigned char buffer[[512]];
+      unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+      unsigned long val;
+      __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+      __m128i a1, a2;
+      unsigned int crc = 0xffffffff;
+      y8 = _mm512_load_si512((__m512i *)aligned);
+      x0 = _mm512_loadu_si512((__m512i *)k1k2);
+      x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+      x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+      x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+      x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+      a1 = _mm512_extracti32x4_epi32(x1, 3);
+      a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+      x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+      val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+      crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+      return crc != 0;
+    }],
+  [return crc32_avx512_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_CRC="$1"
   pgac_avx512_crc32_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_CRC32_INTRINSICS
 
-
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index 38e7b1889b..99bbeaf5c5 100755
--- a/configure
+++ b/configure
@@ -14728,7 +14728,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14774,7 +14774,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14798,7 +14798,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14843,7 +14843,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14867,7 +14867,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17360,206 +17360,111 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
-#
-# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
-# with the default compiler flags. If not, check if adding the -msse4.2
-# flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics_+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <nmmintrin.h>
-int
-main ()
-{
-unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics_=yes
-else
-  pgac_cv_sse42_crc32_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics_" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics_" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics_" = x"yes"; then
-  CFLAGS_CRC=""
-  pgac_sse42_crc32_intrinsics=yes
-fi
-
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics__msse4_2+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -msse4.2"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <nmmintrin.h>
-int
-main ()
-{
-unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=yes
-else
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics__msse4_2" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics__msse4_2" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics__msse4_2" = x"yes"; then
-  CFLAGS_CRC="-msse4.2"
-  pgac_sse42_crc32_intrinsics=yes
-fi
-
-fi
-
 # Check for Intel AVX-512 intrinsics to do CRC calculations.
 #
 # First check if the _mm512_clmulepi64_epi128 and more intrinsics can
 # be used with the default compiler flags. If not, check if adding
-# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
-# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("avx512f","avx512vl","vpclmulqdq")))
+    static int crc32_avx512_test(void)
+    {
+      const unsigned long k1k2[8] = {
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+      unsigned char buffer[512];
+      unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+      unsigned long val;
+      __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+      __m128i a1, a2;
+      unsigned int crc = 0xffffffff;
+      y8 = _mm512_load_si512((__m512i *)aligned);
+      x0 = _mm512_loadu_si512((__m512i *)k1k2);
+      x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+      x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+      x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+      x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+      a1 = _mm512_extracti32x4_epi32(x1, 3);
+      a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+      x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+      val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+      crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+      return crc != 0;
+    }
 int
 main ()
 {
-const unsigned long k1k2[8] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[512];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;
+return crc32_avx512_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_crc32_intrinsics_=yes
+  pgac_cv_avx512_crc32_intrinsics=yes
 else
-  pgac_cv_avx512_crc32_intrinsics_=no
+  pgac_cv_avx512_crc32_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
-  CFLAGS_CRC=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
   pgac_avx512_crc32_intrinsics=yes
 fi
 
-if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
-$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
-if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+
+# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+#
+# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
+# with the default compiler flags. If not, check if adding the -msse4.2
+# flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with function attribute" >&5
+$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with function attribute... " >&6; }
+if ${pgac_cv_sse42_crc32_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
-#include <immintrin.h>
+#include <nmmintrin.h>
+    __attribute__((target("sse4.2")))
+    static int crc32_sse42_test(void)
+    {
+      unsigned int crc = 0;
+      crc = _mm_crc32_u8(crc, 0);
+      crc = _mm_crc32_u32(crc, 0);
+      /* return computed value, to prevent the above being optimized away */
+      return crc == 0;
+    }
 int
 main ()
 {
-const unsigned long k1k2[8] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[512];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;
+return crc32_sse42_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+  pgac_cv_sse42_crc32_intrinsics=yes
 else
-  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+  pgac_cv_sse42_crc32_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
-$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
-if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
-  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
-  pgac_avx512_crc32_intrinsics=yes
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_sse42_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
+  pgac_sse42_crc32_intrinsics=yes
 fi
 
-fi
 
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
@@ -17714,6 +17619,7 @@ fi
 
 
 
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -17733,108 +17639,72 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel AVX 512 if available.
-  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
-    USE_AVX512_CRC32C=1
-  else
-   # Use Intel SSE 4.2 if available.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-      USE_SSE42_CRC32C=1
-    else
-      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
-      # the runtime check.
-      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
-      else
-        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-        # the runtime check.
-        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # Use ARM CRC Extension if available.
-          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-            USE_ARMV8_CRC32C=1
-          else
-            # ARM CRC Extension, with runtime check?
-            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-            else
-              # LoongArch CRCC instructions.
-              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-                USE_LOONGARCH_CRC32C=1
-              else
-                # fall back to slicing-by-8 algorithm, which doesn't require any
-                # special CPU support.
-                USE_SLICING_BY_8_CRC32C=1
-              fi
-            fi
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
-$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+$as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
-$as_echo "AVX 512 with runtime check" >&6; }
-  else
-    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
-$as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-      else
-        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-        else
-          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-          else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-          fi
-        fi
       fi
     fi
   fi
diff --git a/configure.ac b/configure.ac
index 70c78d11fa..c2d516adae 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2066,26 +2066,19 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps.
+PGAC_AVX512_CRC32_INTRINSICS()
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
 # with the default compiler flags. If not, check if adding the -msse4.2
 # flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
-PGAC_SSE42_CRC32_INTRINSICS([])
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
-fi
-
-# Check for Intel AVX-512 intrinsics to do CRC calculations.
-#
-# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
-# be used with the default compiler flags. If not, check if adding
-# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
-# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
-PGAC_AVX512_CRC32_INTRINSICS([])
-if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
-  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
-fi
+PGAC_SSE42_CRC32_INTRINSICS()
 
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
@@ -2113,6 +2106,7 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2132,86 +2126,50 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel AVX 512 if available.
-  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
-    USE_AVX512_CRC32C=1
-  else
-   # Use Intel SSE 4.2 if available.
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-      USE_SSE42_CRC32C=1
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
-      # the runtime check.
-      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
-      else
-        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-        # the runtime check.
         if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # Use ARM CRC Extension if available.
-          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-            USE_ARMV8_CRC32C=1
-          else
-            # ARM CRC Extension, with runtime check?
-            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-            else
-              # LoongArch CRCC instructions.
-              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-                USE_LOONGARCH_CRC32C=1
-              else
-                # fall back to slicing-by-8 algorithm, which doesn't require any
-                # special CPU support.
-                USE_SLICING_BY_8_CRC32C=1
-              fi
-            fi
-          fi
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
-    AC_MSG_RESULT(AVX 512 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-      AC_MSG_RESULT(SSE 4.2 with runtime check)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
-        else
-          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-            AC_MSG_RESULT(LoongArch CRCC instructions)
-          else
-            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-            AC_MSG_RESULT(slicing-by-8)
-          fi
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
diff --git a/meson.build b/meson.build
index aefb64c094..5ec7975108 100644
--- a/meson.build
+++ b/meson.build
@@ -2233,9 +2233,10 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
-    avx_prog = '''
+    avx512_crc_prog = '''
 #include <immintrin.h>
 
+__attribute__((target("avx512vl","vpclmulqdq")))
 int main(void)
 {
   const unsigned long k1k2[8] = {
@@ -2262,9 +2263,12 @@ int main(void)
 }
 '''
 
-    prog = '''
+    sse42_crc_prog = '''
 #include <nmmintrin.h>
 
+#ifdef TEST_SSE42_WITH_ATTRIBUTE
+__attribute__((target("sse4.2")))
+#endif
 int main(void)
 {
     unsigned int crc = 0;
@@ -2274,29 +2278,25 @@ int main(void)
     return crc == 0;
 }
 '''
-
-    if cc.links(avx_prog,
-          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
-          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
-      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
-      cdata.set('USE_AVX512_CRC32C', false)
-      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
-      have_optimized_crc = true
-    endif
-    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(sse42_crc_prog, name: 'CRC32C baseline feature SSE4.2 ',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
-          args: test_c_args + ['-msse4.2'])
+    elif cc.links(sse42_crc_prog, name: 'SSE4.2 CRC32C with function attributes',
+          args: test_c_args + ['-D TEST_SSE42_WITH_ATTRIBUTE'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
-      cflags_crc += '-msse4.2'
       cdata.set('USE_SSE42_CRC32C', false)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+          name: 'AVX512 CRC32C with function attributes',
+          args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 65623df7f9..2c9278329b 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
@@ -712,9 +715,6 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
-/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
-#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
-
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 3f83d9f815..935c089eb6 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -33,6 +33,7 @@
 #ifndef PG_CRC32C_H
 #define PG_CRC32C_H
 
+#include "pg_cpu.h"
 #include "port/pg_bswap.h"
 
 typedef uint32 pg_crc32c;
@@ -42,73 +43,35 @@ typedef uint32 pg_crc32c;
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined (USE_AVX512_CRC32)
-/* Use Intel AVX512 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
 
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel AVX-512 instructions, but perform a runtime check first to check that
- * they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = ((len) < 256 ? \
-		pg_comp_crc32c_sse42((crc), (data), (len)) : \
-		pg_comp_crc32c((crc), (data), (len))))
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
-
-extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
-
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
-
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index 42c02f1b3d..805509b830 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -83,16 +83,6 @@ libpgport.a: $(OBJS)
 	rm -f $@
 	$(AR) $(AROPT) $@ $^
 
-# all versions of pg_crc32c_sse42.o need CFLAGS_CRC
-pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
-# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
-pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 3f17cd2f8d..1144a967e6 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,7 +7,6 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
-  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -23,6 +22,16 @@ pgport_sources = [
   'tar.c',
 ]
 
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+  pgport_sources += files(
+  'pg_popcount_avx512.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
+  'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
+    )
+endif
+
 if host_system == 'windows'
   pgport_sources += files(
     'dirmod.c',
@@ -81,16 +90,7 @@ endif
 # is true
 replace_funcs_pos = [
   # x86/x64
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
-  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index 98353f7e1d..3687f69da2 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -57,7 +57,11 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
 pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl", "vpclmulqdq")
 inline pg_crc32c
 pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 {
@@ -195,3 +199,4 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 	 */
 	return pg_comp_crc32c_sse42(crc, input, length);
 }
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
deleted file mode 100644
index 4f11c278be..0000000000
--- a/src/port/pg_crc32c_avx512_choose.c
+++ /dev/null
@@ -1,42 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_avx512_choose.c
- *	  Choose between Intel AVX-512 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel AVX-
- * 512. If it does, use the special AVX-512 instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_avx512_available())
-		pg_comp_crc32c = pg_comp_crc32c_avx512;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index 7f88c11480..0d6829af5c 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -18,7 +18,10 @@
 
 #include "port/pg_crc32c.h"
 
+#if defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
+
 pg_attribute_no_sanitize_alignment()
+pg_attribute_target("sse4.2")
 pg_crc32c
 pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 {
@@ -67,3 +70,4 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+#endif // SSE42_CRC32
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_x86_choose.c
similarity index 58%
rename from src/port/pg_crc32c_sse42_choose.c
rename to src/port/pg_crc32c_x86_choose.c
index 36e6949362..fa028327fb 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -1,19 +1,18 @@
 /*-------------------------------------------------------------------------
  *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
  *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
+ *	  src/port/pg_crc32c_x86_choose.c
  *
  *-------------------------------------------------------------------------
  */
@@ -30,11 +29,17 @@
 static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
-	if (pg_crc32c_sse42_available())
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available())
 		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
+#endif
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+#endif
 	return pg_comp_crc32c(crc, data, len);
 }
 
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 35d6f9cdb1..c697d25b76 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -96,6 +96,9 @@ osxsave_available(void)
  * NB: Caller is responsible for verifying that osxsave_available() returns true
  * before calling this.
  */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
 inline static bool
 zmm_regs_available(void)
 {
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index b598e86554..6f18561cfb 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,19 +12,12 @@
  */
 #include "c.h"
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 #include <immintrin.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
+#include "port/pg_hw_feat_check.h"
 
 /*
  * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
@@ -33,75 +26,6 @@
  */
 #if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.43.0

#37Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Devulapalli, Raghuveer (#36)
6 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

v6: Fixing build failure on Windows/MSVC.

Raghuveer

Attachments:

v6-0001-Add-a-Postgres-SQL-function-for-crc32c-testing.patchapplication/octet-stream; name=v6-0001-Add-a-Postgres-SQL-function-for-crc32c-testing.patchDownload
From b601e7b4ee9f25fd32e9d8d056bb20a03d755a8a Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v6 1/6] Add a Postgres SQL function for crc32c testing.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/test_crc32c/Makefile         | 20 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 41 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 4 files changed, 66 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..5273158faf
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,41 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+
+#include "port/pg_crc32c.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	pg_crc32c		crc		= 0xFFFFFFFF;
+	const char*		data	= malloc((size_t)num);
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		memset((void*)data, count, (size_t)Min(16,num));
+		crc = COMP_CRC32C(crc, data, num);
+	}
+
+	FIN_CRC32C(crc);
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.43.0

v6-0002-Move-all-HW-checks-to-common-file.patchapplication/octet-stream; name=v6-0002-Move-all-HW-checks-to-common-file.patchDownload
From da26645ec8515e0e6d91e2311a83c3bb6649017e Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v6 2/6] Move all HW checks to common file.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 configure                            |  12 +-
 configure.ac                         |   2 +-
 src/include/port/pg_bitutils.h       |   1 -
 src/include/port/pg_hw_feat_check.h  |  33 ++++++
 src/port/Makefile                    |   9 +-
 src/port/meson.build                 |   2 +-
 src/port/pg_bitutils.c               |  22 +---
 src/port/pg_crc32c_sse42_choose.c    |  27 +----
 src/port/pg_hw_feat_check.c          | 159 +++++++++++++++++++++++++++
 src/port/pg_popcount_avx512_choose.c | 102 -----------------
 10 files changed, 208 insertions(+), 161 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/configure b/configure
index 3a577e463b..cd43a63892 100755
--- a/configure
+++ b/configure
@@ -14731,7 +14731,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14777,7 +14777,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14801,7 +14801,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14846,7 +14846,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14870,7 +14870,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17440,7 +17440,7 @@ fi
 
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
diff --git a/configure.ac b/configure.ac
index 55f6c46d33..5a7cc3f6f2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2072,7 +2072,7 @@ if test x"$host_cpu" = x"x86_64"; then
     PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
   fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 9324ec2d9f..aecfe5f62b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -92,10 +93,10 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+# all versions of pg_hw_feat_check.o need CFLAGS_XSAVE
+pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
+pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
 # all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
 pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
diff --git a/src/port/meson.build b/src/port/meson.build
index 1150966ab7..907ddce33f 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -85,7 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
-  ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
+  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..36e6949362 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,31 +20,8 @@
 
 #include "c.h"
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +38,4 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..455005add5
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- *    Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *    src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
-#endif							/* TRY_POPCNT_FAST */
-- 
2.43.0

v6-0003-Add-support-for-the-SIMD-AVX-512-crc32c-algorithm.patchapplication/octet-stream; name=v6-0003-Add-support-for-the-SIMD-AVX-512-crc32c-algorithm.patchDownload
From 99a17e7097625f7029695d2e41f7d414fbd020d8 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Fri, 9 Aug 2024 08:00:09 -0700
Subject: [PATCH v6 3/6] Add support for the SIMD AVX-512 crc32c algorithm.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 config/c-compiler.m4                |  48 ++++++
 configure                           | 213 ++++++++++++++++++++-----
 configure.ac                        | 106 +++++++-----
 meson.build                         |  40 ++++-
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_crc32c.h        |  23 +++
 src/include/port/pg_hw_feat_check.h |   9 +-
 src/port/Makefile                   |   5 +
 src/port/meson.build                |   4 +
 src/port/pg_crc32c_avx512.c         | 239 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_avx512_choose.c  |  42 +++++
 src/port/pg_hw_feat_check.c         |  71 ++++++++-
 12 files changed, 717 insertions(+), 86 deletions(-)
 create mode 100644 src/port/pg_crc32c_avx512.c
 create mode 100644 src/port/pg_crc32c_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..1d33932cb5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -628,6 +628,54 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index cd43a63892..474282e8ba 100755
--- a/configure
+++ b/configure
@@ -17533,6 +17533,123 @@ fi
 
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics_=yes
+else
+  pgac_cv_avx512_crc32_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
+  CFLAGS_CRC=""
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+else
+  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
+  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17705,31 +17822,42 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -17748,44 +17876,53 @@ $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
+$as_echo "AVX 512 with runtime check" >&6; }
+  else
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+    else
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      else
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+        else
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+          else
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
+          fi
         fi
       fi
     fi
diff --git a/configure.ac b/configure.ac
index 5a7cc3f6f2..5d7ececfbc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2089,6 +2089,17 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
   PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
+# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
+PGAC_AVX512_CRC32_INTRINSICS([])
+if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
+  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
+fi
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2134,31 +2145,42 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+  # Use Intel AVX 512 if available.
+  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
+    USE_AVX512_CRC32C=1
   else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+   # Use Intel SSE 4.2 if available.
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      USE_SSE42_CRC32C=1
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
+      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
+      # the runtime check.
+      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
       else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
+        # the runtime check.
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
         else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
+          # Use ARM CRC Extension if available.
+          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+            USE_ARMV8_CRC32C=1
           else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
+            # ARM CRC Extension, with runtime check?
+            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
+            else
+              # LoongArch CRCC instructions.
+              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+                USE_LOONGARCH_CRC32C=1
+              else
+                # fall back to slicing-by-8 algorithm, which doesn't require any
+                # special CPU support.
+                USE_SLICING_BY_8_CRC32C=1
+              fi
+            fi
           fi
         fi
       fi
@@ -2173,29 +2195,35 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
   PG_CRC32C_OBJS="pg_crc32c_sse42.o"
   AC_MSG_RESULT(SSE 4.2)
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    AC_MSG_RESULT(AVX 512 with runtime check)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+      AC_MSG_RESULT(SSE 4.2 with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+        AC_MSG_RESULT(ARMv8 CRC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
+        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
         else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
+          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+            AC_MSG_RESULT(LoongArch CRCC instructions)
+          else
+            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+            AC_MSG_RESULT(slicing-by-8)
+          fi
         fi
       fi
     fi
diff --git a/meson.build b/meson.build
index 58e67975e8..fab6373fef 100644
--- a/meson.build
+++ b/meson.build
@@ -2242,6 +2242,34 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
+    avx_prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+  const unsigned long k1k2[8] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[512];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;
+}
+'''
 
     prog = '''
 #include <nmmintrin.h>
@@ -2256,12 +2284,20 @@ int main(void)
 }
 '''
 
-    if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(avx_prog,
+          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
+          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
+      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
+      cdata.set('USE_AVX512_CRC32C', false)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
+    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
           args: test_c_args + ['-msse4.2'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 427030f31a..65623df7f9 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -712,6 +712,9 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..ade06dbcab 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -49,6 +49,14 @@ typedef uint32 pg_crc32c;
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined (USE_AVX512_CRC32)
+/* Use Intel AVX512 instructions. */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_ARMV8_CRC32C)
 /* Use ARMv8 CRC Extension instructions. */
 
@@ -67,6 +75,21 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
+#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+/*
+ * Use Intel AVX-512 instructions, but perform a runtime check first to check that
+ * they are available.
+ */
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
+
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..21ee8615e1 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,11 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
-#endif							/* PG_HW_FEAT_CHECK_H */
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern bool pg_crc32c_avx512_available(void);
+
+#endif						/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index aecfe5f62b..b72deed50e 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -88,6 +88,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
+pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
+
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 907ddce33f..e3b05622d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -83,6 +83,10 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
+  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..be42a34a73
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,239 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+/*
+ * Process eight bytes of data at a time.
+ *
+ * NB: We do unaligned accesses here. The Intel architecture allows that,
+ * and performance testing didn't show any performance gain from aligning
+ * the begin address.
+ */
+pg_attribute_no_sanitize_alignment()
+inline static pg_crc32c
+crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
+{
+	const unsigned char *pend = p + length;
+
+	/*
+	 * Process eight bytes of data at a time.
+	 *
+	 * NB: We do unaligned accesses here. The Intel architecture allows that,
+	 * and performance testing didn't show any performance gain from aligning
+	 * the begin address.
+	 */
+	while (p + 8 <= pend)
+	{
+		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
+		p += 8;
+	}
+
+	/* Process remaining full four bytes if any */
+	if (p + 4 <= pend)
+	{
+		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
+		p += 4;
+	}
+
+	/* Process any remaining bytes one at a time. */
+	while (p < pend)
+	{
+		crc = _mm_crc32_u8(crc, *p);
+		p++;
+	}
+
+	return crc;
+}
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+pg_attribute_no_sanitize_alignment()
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes.
+	 */
+	return crc32c_fallback(crc, input, length);
+}
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
new file mode 100644
index 0000000000..4f11c278be
--- /dev/null
+++ b/src/port/pg_crc32c_avx512_choose.c
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512_choose.c
+ *	  Choose between Intel AVX-512 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-
+ * 512. If it does, use the special AVX-512 instructions for CRC-32C
+ * computation. Otherwise, fall back to the pure software implementation
+ * (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static pg_crc32c
+pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
+{
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+	else
+		pg_comp_crc32c = pg_comp_crc32c_sb8;
+
+	return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 455005add5..35d6f9cdb1 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -132,9 +132,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -151,9 +202,17 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+inline bool
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
-- 
2.43.0

v6-0004-New-COMP_CRC32C-macro-for-AVX512-simplify-code-so.patchapplication/octet-stream; name=v6-0004-New-COMP_CRC32C-macro-for-AVX512-simplify-code-so.patchDownload
From 558da005e91b71517d788498c36d66c900366bfe Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 27 Aug 2024 08:26:19 -0700
Subject: [PATCH v6 4/6] New COMP_CRC32C macro for AVX512, simplify code some.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 configure                    |  2 +-
 configure.ac                 |  2 +-
 src/include/port/pg_crc32c.h | 17 ++++++-------
 src/port/meson.build         |  1 +
 src/port/pg_crc32c_avx512.c  | 46 ++----------------------------------
 5 files changed, 12 insertions(+), 56 deletions(-)

diff --git a/configure b/configure
index 474282e8ba..8af995e48f 100755
--- a/configure
+++ b/configure
@@ -17880,7 +17880,7 @@ else
 
 $as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
 $as_echo "AVX 512 with runtime check" >&6; }
   else
diff --git a/configure.ac b/configure.ac
index 5d7ececfbc..a8c7911754 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2197,7 +2197,7 @@ if test x"$USE_SSE42_CRC32C" = x"1"; then
 else
   if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
     AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_avx512_choose.o"
+    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
     AC_MSG_RESULT(AVX 512 with runtime check)
   else
     if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index ade06dbcab..3f83d9f815 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -40,12 +40,12 @@ typedef uint32 pg_crc32c;
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
+#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 #if defined(USE_SSE42_CRC32C)
 /* Use Intel SSE4.2 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 
@@ -53,7 +53,6 @@ extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t le
 /* Use Intel AVX512 instructions. */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
@@ -62,7 +61,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 
@@ -71,7 +69,6 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 
@@ -82,14 +79,17 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
  * they are available.
  */
 #define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
+	((crc) = ((len) < 256 ? \
+		pg_comp_crc32c_sse42((crc), (data), (len)) : \
+		pg_comp_crc32c((crc), (data), (len))))
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
 
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
 
+extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+
 #elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
 
 /*
@@ -98,7 +98,6 @@ extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t l
  */
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
@@ -121,13 +120,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index e3b05622d1..b53c33c8eb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -86,6 +86,7 @@ replace_funcs_pos = [
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
   ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index be42a34a73..98353f7e1d 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -18,48 +18,6 @@
 
 #include "port/pg_crc32c.h"
 
-/*
- * Process eight bytes of data at a time.
- *
- * NB: We do unaligned accesses here. The Intel architecture allows that,
- * and performance testing didn't show any performance gain from aligning
- * the begin address.
- */
-pg_attribute_no_sanitize_alignment()
-inline static pg_crc32c
-crc32c_fallback(pg_crc32c crc, const uint8 *p, size_t length)
-{
-	const unsigned char *pend = p + length;
-
-	/*
-	 * Process eight bytes of data at a time.
-	 *
-	 * NB: We do unaligned accesses here. The Intel architecture allows that,
-	 * and performance testing didn't show any performance gain from aligning
-	 * the begin address.
-	 */
-	while (p + 8 <= pend)
-	{
-		crc = (uint32)_mm_crc32_u64(crc, *((const uint64 *)p));
-		p += 8;
-	}
-
-	/* Process remaining full four bytes if any */
-	if (p + 4 <= pend)
-	{
-		crc = _mm_crc32_u32(crc, *((const unsigned int *)p));
-		p += 4;
-	}
-
-	/* Process any remaining bytes one at a time. */
-	while (p < pend)
-	{
-		crc = _mm_crc32_u8(crc, *p);
-		p++;
-	}
-
-	return crc;
-}
 
 /*******************************************************************
  * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
@@ -233,7 +191,7 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 	}
 
 	/*
-	 * Finish any remaining bytes.
+	 * Finish any remaining bytes with legacy AVX algorithm.
 	 */
-	return crc32c_fallback(crc, input, length);
+	return pg_comp_crc32c_sse42(crc, input, length);
 }
-- 
2.43.0

v6-0005-use-__attribute__-target-.-for-AVX-512-stuff.patchapplication/octet-stream; name=v6-0005-use-__attribute__-target-.-for-AVX-512-stuff.patchDownload
From a495124ee42cb8f9f206f719b9f2235aff715963 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v6 5/6] use __attribute__((target(...))) for AVX-512 stuff

Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 config/c-compiler.m4          |  60 ++++++-------
 configure                     | 163 ++++++++--------------------------
 configure.ac                  |  17 +---
 meson.build                   |  17 +---
 src/Makefile.global.in        |   5 --
 src/include/c.h               |  10 +++
 src/makefiles/meson.build     |   4 +-
 src/port/Makefile             |   7 +-
 src/port/meson.build          |   6 +-
 src/port/pg_popcount_avx512.c |  86 +++++++++++++++++-
 10 files changed, 171 insertions(+), 204 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 1d33932cb5..33df694ae7 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -748,20 +748,20 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("xsave")))
+    static int xsave_test(void)
+    {
+      return _xgetbv(0) & 0xe0;
+    }],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -773,29 +773,27 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("avx512vpopcntdq","avx512bw")))
+    static int popcount_test(void)
+    {
+      const char buf@<:@sizeof(__m512i)@:>@;
+      PG_INT64_TYPE popcnt = 0;
+      __m512i accum = _mm512_setzero_si512();
+      const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+      const __m512i cnt = _mm512_popcnt_epi64(val);
+      accum = _mm512_add_epi64(accum, cnt);
+      popcnt = _mm512_reduce_add_epi64(accum);
+      return (int) popcnt;
+    }],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="$1"
   pgac_avx512_popcnt_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 8af995e48f..38e7b1889b 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17270,185 +17267,99 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
-if ${pgac_cv_xsave_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5
+$as_echo_n "checking for _xgetbv... " >&6; }
+if ${pgac_cv_xsave_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <immintrin.h>
-int
-main ()
-{
-return _xgetbv(0) & 0xe0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_xsave_intrinsics_=yes
-else
-  pgac_cv_xsave_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
-$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
-if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
-  CFLAGS_XSAVE=""
-  pgac_xsave_intrinsics=yes
-fi
-
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
-if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mxsave"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("xsave")))
+    static int xsave_test(void)
+    {
+      return _xgetbv(0) & 0xe0;
+    }
 int
 main ()
 {
-return _xgetbv(0) & 0xe0;
+return xsave_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_xsave_intrinsics__mxsave=yes
+  pgac_cv_xsave_intrinsics=yes
 else
-  pgac_cv_xsave_intrinsics__mxsave=no
+  pgac_cv_xsave_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
-$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
-if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
-  CFLAGS_XSAVE="-mxsave"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics" >&5
+$as_echo "$pgac_cv_xsave_intrinsics" >&6; }
+if test x"$pgac_cv_xsave_intrinsics" = x"yes"; then
   pgac_xsave_intrinsics=yes
 fi
 
-fi
 if test x"$pgac_xsave_intrinsics" = x"yes"; then
 
 $as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
 
 fi
 
-
 # Check for AVX-512 popcount intrinsics
 #
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
 if test x"$host_cpu" = x"x86_64"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <immintrin.h>
-int
-main ()
-{
-const char buf[sizeof(__m512i)];
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_popcnt_intrinsics_=yes
-else
-  pgac_cv_avx512_popcnt_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
-  CFLAGS_POPCNT=""
-  pgac_avx512_popcnt_intrinsics=yes
-fi
-
-  if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
-    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("avx512vpopcntdq","avx512bw")))
+    static int popcount_test(void)
+    {
+      const char buf[sizeof(__m512i)];
+      PG_INT64_TYPE popcnt = 0;
+      __m512i accum = _mm512_setzero_si512();
+      const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+      const __m512i cnt = _mm512_popcnt_epi64(val);
+      accum = _mm512_add_epi64(accum, cnt);
+      popcnt = _mm512_reduce_add_epi64(accum);
+      return (int) popcnt;
+    }
 int
 main ()
 {
-const char buf[sizeof(__m512i)];
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;
+return popcount_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+  pgac_cv_avx512_popcnt_intrinsics=yes
 else
-  pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+  pgac_cv_avx512_popcnt_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
-  CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics" = x"yes"; then
   pgac_avx512_popcnt_intrinsics=yes
 fi
 
-  fi
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o"
 
 $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
   fi
 fi
 
-
-
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index a8c7911754..70c78d11fa 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,32 +2052,19 @@ fi
 
 # Check for XSAVE intrinsics
 #
-CFLAGS_XSAVE=""
-PGAC_XSAVE_INTRINSICS([])
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
-  PGAC_XSAVE_INTRINSICS([-mxsave])
-fi
+PGAC_XSAVE_INTRINSICS()
 if test x"$pgac_xsave_intrinsics" = x"yes"; then
   AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
 fi
-AC_SUBST(CFLAGS_XSAVE)
 
 # Check for AVX-512 popcount intrinsics
 #
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
 if test x"$host_cpu" = x"x86_64"; then
-  PGAC_AVX512_POPCNT_INTRINSICS([])
-  if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
-    PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
-  fi
+  PGAC_AVX512_POPCNT_INTRINSICS()
   if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
-    PG_POPCNT_OBJS="pg_popcount_avx512.o"
     AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
   fi
 fi
-AC_SUBST(CFLAGS_POPCNT)
-AC_SUBST(PG_POPCNT_OBJS)
 
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
diff --git a/meson.build b/meson.build
index fab6373fef..aefb64c094 100644
--- a/meson.build
+++ b/meson.build
@@ -2157,25 +2157,20 @@ endforeach
 # Check for the availability of XSAVE intrinsics.
 ###############################################################
 
-cflags_xsave = []
 if host_cpu == 'x86' or host_cpu == 'x86_64'
 
   prog = '''
 #include <immintrin.h>
 
+__attribute__((target("xsave")))
 int main(void)
 {
     return _xgetbv(0) & 0xe0;
 }
 '''
 
-  if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
-        args: test_c_args)
+  if cc.links(prog, name: 'XSAVE intrinsics', args: test_c_args)
     cdata.set('HAVE_XSAVE_INTRINSICS', 1)
-  elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
-        args: test_c_args + ['-mxsave'])
-    cdata.set('HAVE_XSAVE_INTRINSICS', 1)
-    cflags_xsave += '-mxsave'
   endif
 
 endif
@@ -2185,12 +2180,12 @@ endif
 # Check for the availability of AVX-512 popcount intrinsics.
 ###############################################################
 
-cflags_popcnt = []
 if host_cpu == 'x86_64'
 
   prog = '''
 #include <immintrin.h>
 
+__attribute__((target("avx512vpopcntdq","avx512bw")))
 int main(void)
 {
     const char buf[sizeof(__m512i)];
@@ -2205,13 +2200,9 @@ int main(void)
 }
 '''
 
-  if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+  if cc.links(prog, name: 'AVX-512 popcount',
         args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
     cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
-  elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
-        args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
-    cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
-    cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
   endif
 
 endif
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b4976..45696247e9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,9 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
 CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
-CFLAGS_POPCNT = @CFLAGS_POPCNT@
 CFLAGS_CRC = @CFLAGS_CRC@
-CFLAGS_XSAVE = @CFLAGS_XSAVE@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
 CXXFLAGS = @CXXFLAGS@
@@ -762,9 +760,6 @@ LIBOBJS = @LIBOBJS@
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
 
-# files needed for the chosen popcount implementation
-PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
-
 LIBS := -lpgcommon -lpgport $(LIBS)
 
 # to make ws2_32.lib the last library
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..6f5ca25542 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -174,6 +174,16 @@
 #define pg_attribute_nonnull(...)
 #endif
 
+/*
+ * pg_attribute_target allows specifying different target options that the
+ * function should be compiled with (e.g., for using special CPU instructions).
+ */
+#if __has_attribute (target)
+#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
+#else
+#define pg_attribute_target(...)
+#endif
+
 /*
  * Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
  * used in assert-enabled builds, to avoid compiler warnings about unused
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e927584..479aa08420 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,10 +102,8 @@ pgxs_kv = {
     ' '.join(cflags_no_missing_var_decls),
 
   'CFLAGS_CRC': ' '.join(cflags_crc),
-  'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
   'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
   'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
-  'CFLAGS_XSAVE': ' '.join(cflags_xsave),
 
   'LDFLAGS': var_ldflags,
   'LDFLAGS_EX': var_ldflags_ex,
@@ -181,7 +179,7 @@ pgxs_empty = [
   'WANTED_LANGUAGES',
 
   # Not needed because we don't build the server / PLs with the generated makefile
-  'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
+  'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
   'DTRACEFLAGS', # only server has dtrace probes
 
   'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index b72deed50e..42c02f1b3d 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,7 +38,6 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
-	$(PG_POPCNT_OBJS) \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
@@ -46,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_hw_feat_check.o \
+	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
@@ -103,11 +103,6 @@ pg_hw_feat_check.o:	CFLAGS+=$(CFLAGS_XSAVE)
 pg_hw_feat_check_shlib.o:	CFLAGS+=$(CFLAGS_XSAVE)
 pg_hw_feat_check_srv.o:	CFLAGS+=$(CFLAGS_XSAVE)
 
-# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
-pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
-
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index b53c33c8eb..3f17cd2f8d 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -89,7 +90,6 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
   ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
 
   # arm / aarch64
@@ -105,8 +105,8 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
-pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
-pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
+pgport_cflags = {'crc': cflags_crc}
+pgport_sources_cflags = {'crc': []}
 
 foreach f : replace_funcs_neg
   func = f.get(0)
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 9d3149e2d0..b598e86554 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,7 +12,17 @@
  */
 #include "c.h"
 
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 #include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
 
 #include "port/pg_bitutils.h"
 
@@ -21,12 +31,82 @@
  * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
  * the function pointers that are only used when TRY_POPCNT_FAST is set.
  */
-#ifdef TRY_POPCNT_FAST
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+
+/*
+ * Does CPUID say there's support for XSAVE instructions?
+ */
+static inline bool
+xsave_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+	return xsave_available() &&
+		zmm_regs_available() &&
+		avx512_popcnt_available();
+}
 
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
  */
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
@@ -82,6 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
  * pg_popcount_masked_avx512
  *		Returns the number of 1-bits in buf after applying the mask to each byte
  */
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
 uint64
 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 {
@@ -138,4 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 	return _mm512_reduce_add_epi64(accum);
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_FAST &&
+								 * USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.43.0

v6-0006-Use-__attribute__-target-.-for-SSE42-and-AVX512-C.patchapplication/octet-stream; name=v6-0006-Use-__attribute__-target-.-for-SSE42-and-AVX512-C.patchDownload
From 7ac29b452f024e43cb34e77600f3d232d4a80874 Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Mon, 21 Oct 2024 14:26:22 -0700
Subject: [PATCH v6 6/6]  Use __attribute__(target(...)) for SSE42 and AVX512
 CRC32C

---
 config/c-compiler.m4                          |  88 ++---
 configure                                     | 350 ++++++------------
 configure.ac                                  | 130 +++----
 meson.build                                   |  30 +-
 src/include/pg_config.h.in                    |   6 +-
 src/include/pg_cpu.h                          |  23 ++
 src/include/port/pg_crc32c.h                  |  71 +---
 src/port/Makefile                             |  10 -
 src/port/meson.build                          |  24 +-
 src/port/pg_crc32c_avx512.c                   |   5 +
 src/port/pg_crc32c_avx512_choose.c            |  42 ---
 src/port/pg_crc32c_sse42.c                    |   4 +
 ..._sse42_choose.c => pg_crc32c_x86_choose.c} |  27 +-
 src/port/pg_hw_feat_check.c                   |   3 +
 src/port/pg_popcount_avx512.c                 |  78 +---
 15 files changed, 297 insertions(+), 594 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 delete mode 100644 src/port/pg_crc32c_avx512_choose.c
 rename src/port/{pg_crc32c_sse42_choose.c => pg_crc32c_x86_choose.c} (58%)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 33df694ae7..d7b3ceeb60 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -608,21 +608,22 @@ fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 # An optional compiler flag can be passed as argument (e.g. -msse4.2). If the
 # intrinsics are supported, sets pgac_sse42_crc32_intrinsics, and CFLAGS_CRC.
 AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>],
-  [unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>
+    __attribute__((target("sse4.2")))
+    static int crc32_sse42_test(void)
+    {
+      unsigned int crc = 0;
+      crc = _mm_crc32_u8(crc, 0);
+      crc = _mm_crc32_u32(crc, 0);
+      /* return computed value, to prevent the above being optimized away */
+      return crc == 0;
+    }],
+  [return crc32_sse42_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_CRC="$1"
   pgac_sse42_crc32_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -639,44 +640,45 @@ undefine([Ac_cachevar])dnl
 # An optional compiler flag can be passed as arguments (e.g. -msse4.2
 # -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
 # pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+
 AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
-  [const unsigned long k1k2[[8]] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[[512]];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    __attribute__((target("avx512f","avx512vl","vpclmulqdq")))
+    static int crc32_avx512_test(void)
+    {
+      const unsigned long k1k2[[8]] = {
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+      unsigned char buffer[[512]];
+      unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+      unsigned long val;
+      __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+      __m128i a1, a2;
+      unsigned int crc = 0xffffffff;
+      y8 = _mm512_load_si512((__m512i *)aligned);
+      x0 = _mm512_loadu_si512((__m512i *)k1k2);
+      x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+      x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+      x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+      x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+      a1 = _mm512_extracti32x4_epi32(x1, 3);
+      a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+      x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+      val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+      crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+      return crc != 0;
+    }],
+  [return crc32_avx512_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_CRC="$1"
   pgac_avx512_crc32_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_CRC32_INTRINSICS
 
-
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index 38e7b1889b..99bbeaf5c5 100755
--- a/configure
+++ b/configure
@@ -14728,7 +14728,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14774,7 +14774,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14798,7 +14798,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14843,7 +14843,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -14867,7 +14867,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
+#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -17360,206 +17360,111 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
-#
-# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
-# with the default compiler flags. If not, check if adding the -msse4.2
-# flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics_+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <nmmintrin.h>
-int
-main ()
-{
-unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics_=yes
-else
-  pgac_cv_sse42_crc32_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics_" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics_" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics_" = x"yes"; then
-  CFLAGS_CRC=""
-  pgac_sse42_crc32_intrinsics=yes
-fi
-
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics__msse4_2+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -msse4.2"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include <nmmintrin.h>
-int
-main ()
-{
-unsigned int crc = 0;
-   crc = _mm_crc32_u8(crc, 0);
-   crc = _mm_crc32_u32(crc, 0);
-   /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=yes
-else
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics__msse4_2" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics__msse4_2" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics__msse4_2" = x"yes"; then
-  CFLAGS_CRC="-msse4.2"
-  pgac_sse42_crc32_intrinsics=yes
-fi
-
-fi
-
 # Check for Intel AVX-512 intrinsics to do CRC calculations.
 #
 # First check if the _mm512_clmulepi64_epi128 and more intrinsics can
 # be used with the default compiler flags. If not, check if adding
-# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
-# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_crc32_intrinsics_+:} false; then :
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <immintrin.h>
+    __attribute__((target("avx512f","avx512vl","vpclmulqdq")))
+    static int crc32_avx512_test(void)
+    {
+      const unsigned long k1k2[8] = {
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+      0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+      unsigned char buffer[512];
+      unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+      unsigned long val;
+      __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+      __m128i a1, a2;
+      unsigned int crc = 0xffffffff;
+      y8 = _mm512_load_si512((__m512i *)aligned);
+      x0 = _mm512_loadu_si512((__m512i *)k1k2);
+      x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+      x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+      x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+      x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+      a1 = _mm512_extracti32x4_epi32(x1, 3);
+      a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+      x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+      val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+      crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+      return crc != 0;
+    }
 int
 main ()
 {
-const unsigned long k1k2[8] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[512];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;
+return crc32_avx512_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_crc32_intrinsics_=yes
+  pgac_cv_avx512_crc32_intrinsics=yes
 else
-  pgac_cv_avx512_crc32_intrinsics_=no
+  pgac_cv_avx512_crc32_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_crc32_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_crc32_intrinsics_" = x"yes"; then
-  CFLAGS_CRC=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
   pgac_avx512_crc32_intrinsics=yes
 fi
 
-if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq" >&5
-$as_echo_n "checking for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=-msse4.2 -mavx512vl -mvpclmulqdq... " >&6; }
-if ${pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq+:} false; then :
+
+# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+#
+# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
+# with the default compiler flags. If not, check if adding the -msse4.2
+# flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with function attribute" >&5
+$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with function attribute... " >&6; }
+if ${pgac_cv_sse42_crc32_intrinsics+:} false; then :
   $as_echo_n "(cached) " >&6
 else
-  pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -msse4.2 -mavx512vl -mvpclmulqdq"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
-#include <immintrin.h>
+#include <nmmintrin.h>
+    __attribute__((target("sse4.2")))
+    static int crc32_sse42_test(void)
+    {
+      unsigned int crc = 0;
+      crc = _mm_crc32_u8(crc, 0);
+      crc = _mm_crc32_u32(crc, 0);
+      /* return computed value, to prevent the above being optimized away */
+      return crc == 0;
+    }
 int
 main ()
 {
-const unsigned long k1k2[8] = {
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
-  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
-  unsigned char buffer[512];
-  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
-  unsigned long val;
-  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-  __m128i a1, a2;
-  unsigned int crc = 0xffffffff;
-  y8 = _mm512_load_si512((__m512i *)aligned);
-  x0 = _mm512_loadu_si512((__m512i *)k1k2);
-  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
-  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
-  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
-  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
-  a1 = _mm512_extracti32x4_epi32(x1, 3);
-  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
-  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
-  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
-  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
-  return crc != 0;
+return crc32_sse42_test();
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=yes
+  pgac_cv_sse42_crc32_intrinsics=yes
 else
-  pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq=no
+  pgac_cv_sse42_crc32_intrinsics=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&5
-$as_echo "$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" >&6; }
-if test x"$pgac_cv_avx512_crc32_intrinsics__msse4_2__mavx512vl__mvpclmulqdq" = x"yes"; then
-  CFLAGS_CRC="-msse4.2 -mavx512vl -mvpclmulqdq"
-  pgac_avx512_crc32_intrinsics=yes
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_sse42_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
+  pgac_sse42_crc32_intrinsics=yes
 fi
 
-fi
 
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
@@ -17714,6 +17619,7 @@ fi
 
 
 
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -17733,108 +17639,72 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel AVX 512 if available.
-  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
-    USE_AVX512_CRC32C=1
-  else
-   # Use Intel SSE 4.2 if available.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-      USE_SSE42_CRC32C=1
-    else
-      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
-      # the runtime check.
-      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
-      else
-        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-        # the runtime check.
-        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # Use ARM CRC Extension if available.
-          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-            USE_ARMV8_CRC32C=1
-          else
-            # ARM CRC Extension, with runtime check?
-            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-            else
-              # LoongArch CRCC instructions.
-              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-                USE_LOONGARCH_CRC32C=1
-              else
-                # fall back to slicing-by-8 algorithm, which doesn't require any
-                # special CPU support.
-                USE_SLICING_BY_8_CRC32C=1
-              fi
-            fi
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
-$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+$as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: AVX 512 with runtime check" >&5
-$as_echo "AVX 512 with runtime check" >&6; }
-  else
-    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
-$as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-      else
-        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-        else
-          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-            { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-          else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-            { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-          fi
-        fi
       fi
     fi
   fi
diff --git a/configure.ac b/configure.ac
index 70c78d11fa..c2d516adae 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2066,26 +2066,19 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for Intel AVX-512 intrinsics to do CRC calculations.
+#
+# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
+# be used with the default compiler flags. If not, check if adding
+# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps.
+PGAC_AVX512_CRC32_INTRINSICS()
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 # First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
 # with the default compiler flags. If not, check if adding the -msse4.2
 # flag helps. CFLAGS_CRC is set to -msse4.2 if that's required.
-PGAC_SSE42_CRC32_INTRINSICS([])
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
-fi
-
-# Check for Intel AVX-512 intrinsics to do CRC calculations.
-#
-# First check if the _mm512_clmulepi64_epi128 and more intrinsics can
-# be used with the default compiler flags. If not, check if adding
-# the -msse4.2, -mavx512vl and -mvpclmulqdqif flag helps. CFLAGS_CRC
-# is set to -msse4.2, -mavx512vl and -mvpclmulqdqif that's required.
-PGAC_AVX512_CRC32_INTRINSICS([])
-if test x"$pgac_avx512_crc32_intrinsics" != x"yes"; then
-  PGAC_AVX512_CRC32_INTRINSICS([-msse4.2 -mavx512vl -mvpclmulqdq])
-fi
+PGAC_SSE42_CRC32_INTRINSICS()
 
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
@@ -2113,6 +2106,7 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
@@ -2132,86 +2126,50 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel AVX 512 if available.
-  if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && test x"$AVX512_TARGETED" = x"1" ; then
-    USE_AVX512_CRC32C=1
-  else
-   # Use Intel SSE 4.2 if available.
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
     if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-      USE_SSE42_CRC32C=1
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Intel AVX 512, with runtime check? The CPUID instruction is needed for
-      # the runtime check.
-      if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_AVX512_CRC32C_WITH_RUNTIME_CHECK=1
-      else
-        # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-        # the runtime check.
         if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-          USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # Use ARM CRC Extension if available.
-          if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-            USE_ARMV8_CRC32C=1
-          else
-            # ARM CRC Extension, with runtime check?
-            if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-              USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-            else
-              # LoongArch CRCC instructions.
-              if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-                USE_LOONGARCH_CRC32C=1
-              else
-                # fall back to slicing-by-8 algorithm, which doesn't require any
-                # special CPU support.
-                USE_SLICING_BY_8_CRC32C=1
-              fi
-            fi
-          fi
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_AVX512_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_avx512.o pg_crc32c_sb8.o pg_crc32c_sse42.o pg_crc32c_avx512_choose.o"
-    AC_MSG_RESULT(AVX 512 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-      AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-      PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-      AC_MSG_RESULT(SSE 4.2 with runtime check)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-          AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-          PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-          AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
-        else
-          if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-            AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-            PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-            AC_MSG_RESULT(LoongArch CRCC instructions)
-          else
-            AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-            PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-            AC_MSG_RESULT(slicing-by-8)
-          fi
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
diff --git a/meson.build b/meson.build
index aefb64c094..5ec7975108 100644
--- a/meson.build
+++ b/meson.build
@@ -2233,9 +2233,10 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
     have_optimized_crc = true
   else
-    avx_prog = '''
+    avx512_crc_prog = '''
 #include <immintrin.h>
 
+__attribute__((target("avx512vl","vpclmulqdq")))
 int main(void)
 {
   const unsigned long k1k2[8] = {
@@ -2262,9 +2263,12 @@ int main(void)
 }
 '''
 
-    prog = '''
+    sse42_crc_prog = '''
 #include <nmmintrin.h>
 
+#ifdef TEST_SSE42_WITH_ATTRIBUTE
+__attribute__((target("sse4.2")))
+#endif
 int main(void)
 {
     unsigned int crc = 0;
@@ -2274,29 +2278,25 @@ int main(void)
     return crc == 0;
 }
 '''
-
-    if cc.links(avx_prog,
-          name: '_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl -mvpclmulqdq',
-          args: test_c_args + ['-msse4.2', '-mavx512vl', '-mvpclmulqdq'])
-      cflags_crc += ['-msse4.2','-mavx512vl','-mvpclmulqdq']
-      cdata.set('USE_AVX512_CRC32C', false)
-      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
-      have_optimized_crc = true
-    endif
-    if have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
+    if cc.links(sse42_crc_prog, name: 'CRC32C baseline feature SSE4.2 ',
           args: test_c_args)
       # Use Intel SSE 4.2 unconditionally.
       cdata.set('USE_SSE42_CRC32C', 1)
       have_optimized_crc = true
-    elif have_optimized_crc == false and cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
-          args: test_c_args + ['-msse4.2'])
+    elif cc.links(sse42_crc_prog, name: 'SSE4.2 CRC32C with function attributes',
+          args: test_c_args + ['-D TEST_SSE42_WITH_ATTRIBUTE'])
       # Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
       # the runtime check.
-      cflags_crc += '-msse4.2'
       cdata.set('USE_SSE42_CRC32C', false)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+          name: 'AVX512 CRC32C with function attributes',
+          args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 65623df7f9..2c9278329b 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
 /* Define to 1 to build with assertion checks. (--enable-cassert) */
 #undef USE_ASSERT_CHECKING
 
+/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
@@ -712,9 +715,6 @@
 /* Define to 1 use Intel SSE 4.2 CRC instructions. */
 #undef USE_SSE42_CRC32C
 
-/* Define to 1 to use Intel AVX 512 CRC instructions with a runtime check. */
-#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
-
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 3f83d9f815..935c089eb6 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -33,6 +33,7 @@
 #ifndef PG_CRC32C_H
 #define PG_CRC32C_H
 
+#include "pg_cpu.h"
 #include "port/pg_bswap.h"
 
 typedef uint32 pg_crc32c;
@@ -42,73 +43,35 @@ typedef uint32 pg_crc32c;
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined (USE_AVX512_CRC32)
-/* Use Intel AVX512 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_avx512((crc), (data), (len)))
-
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
 
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel AVX-512 instructions, but perform a runtime check first to check that
- * they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = ((len) < 256 ? \
-		pg_comp_crc32c_sse42((crc), (data), (len)) : \
-		pg_comp_crc32c((crc), (data), (len))))
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c)(pg_crc32c crc, const void *data, size_t len);
-
-extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
-
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
-
 #else
 /*
  * Use slicing-by-8 algorithm.
diff --git a/src/port/Makefile b/src/port/Makefile
index 42c02f1b3d..805509b830 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -83,16 +83,6 @@ libpgport.a: $(OBJS)
 	rm -f $@
 	$(AR) $(AROPT) $@ $^
 
-# all versions of pg_crc32c_sse42.o need CFLAGS_CRC
-pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
-# all versions of pg_crc32c_avx512.o need CFLAGS_CRC
-pg_crc32c_avx512.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_avx512_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_avx512_srv.o: CFLAGS+=$(CFLAGS_CRC)
-
 # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 3f17cd2f8d..13c4be8ce2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,7 +7,6 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
-  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -23,6 +22,17 @@ pgport_sources = [
   'tar.c',
 ]
 
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+  pgport_sources += files(
+  'pg_hw_feat_check.c',
+  'pg_popcount_avx512.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
+  'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
+    )
+endif
+
 if host_system == 'windows'
   pgport_sources += files(
     'dirmod.c',
@@ -80,18 +90,6 @@ endif
 # Replacement functionality to be built if corresponding configure symbol
 # is true
 replace_funcs_pos = [
-  # x86/x64
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C'],
-  ['pg_crc32c_avx512', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_avx512_choose', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sse42', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
-  ['pg_crc32c_sb8', 'USE_AVX512_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_hw_feat_check', 'HAVE_XSAVE_INTRINSICS', 'xsave'],
-
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
index 98353f7e1d..3687f69da2 100644
--- a/src/port/pg_crc32c_avx512.c
+++ b/src/port/pg_crc32c_avx512.c
@@ -57,7 +57,11 @@
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
 pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl", "vpclmulqdq")
 inline pg_crc32c
 pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 {
@@ -195,3 +199,4 @@ pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
 	 */
 	return pg_comp_crc32c_sse42(crc, input, length);
 }
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_avx512_choose.c b/src/port/pg_crc32c_avx512_choose.c
deleted file mode 100644
index 4f11c278be..0000000000
--- a/src/port/pg_crc32c_avx512_choose.c
+++ /dev/null
@@ -1,42 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_avx512_choose.c
- *	  Choose between Intel AVX-512 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel AVX-
- * 512. If it does, use the special AVX-512 instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_avx512_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_avx512_available())
-		pg_comp_crc32c = pg_comp_crc32c_avx512;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_avx512_choose;
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index 7f88c11480..0d6829af5c 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -18,7 +18,10 @@
 
 #include "port/pg_crc32c.h"
 
+#if defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
+
 pg_attribute_no_sanitize_alignment()
+pg_attribute_target("sse4.2")
 pg_crc32c
 pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 {
@@ -67,3 +70,4 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+#endif // SSE42_CRC32
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_x86_choose.c
similarity index 58%
rename from src/port/pg_crc32c_sse42_choose.c
rename to src/port/pg_crc32c_x86_choose.c
index 36e6949362..fa028327fb 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -1,19 +1,18 @@
 /*-------------------------------------------------------------------------
  *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
  *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
+ *	  src/port/pg_crc32c_x86_choose.c
  *
  *-------------------------------------------------------------------------
  */
@@ -30,11 +29,17 @@
 static pg_crc32c
 pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 {
-	if (pg_crc32c_sse42_available())
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available())
 		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
+#endif
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available())
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+#endif
 	return pg_comp_crc32c(crc, data, len);
 }
 
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 35d6f9cdb1..c697d25b76 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -96,6 +96,9 @@ osxsave_available(void)
  * NB: Caller is responsible for verifying that osxsave_available() returns true
  * before calling this.
  */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
 inline static bool
 zmm_regs_available(void)
 {
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index b598e86554..6f18561cfb 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,19 +12,12 @@
  */
 #include "c.h"
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 #include <immintrin.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
+#include "port/pg_hw_feat_check.h"
 
 /*
  * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
@@ -33,75 +26,6 @@
  */
 #if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.43.0

#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#36)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Oct 29, 2024 at 09:00:17PM +0000, Devulapalli, Raghuveer wrote:

(1) The SSE42 and AVX-512 CRC32C also use function attributes to build
with ISA specific flag..

Would you mind moving the function attribute change for the existing SSE
4.2 code to its own patch? I think that is pretty straightforward, and
IMHO it'd be nice to take care of it first so that we can focus on the new
stuff.

--
nathan

#39Andres Freund
andres@anarazel.de
In reply to: Devulapalli, Raghuveer (#37)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

On 2024-10-30 21:03:20 +0000, Devulapalli, Raghuveer wrote:

v6: Fixing build failure on Windows/MSVC.

Raghuveer

From b601e7b4ee9f25fd32e9d8d056bb20a03d755a8a Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v6 1/6] Add a Postgres SQL function for crc32c testing.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
src/test/modules/test_crc32c/Makefile | 20 +++++++++
.../modules/test_crc32c/test_crc32c--1.0.sql | 1 +
src/test/modules/test_crc32c/test_crc32c.c | 41 +++++++++++++++++++
.../modules/test_crc32c/test_crc32c.control | 4 ++
4 files changed, 66 insertions(+)
create mode 100644 src/test/modules/test_crc32c/Makefile
create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

Needs to be integrated with the meson based build as well.

+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	pg_crc32c		crc		= 0xFFFFFFFF;
+	const char*		data	= malloc((size_t)num);

This is computing a crc of uninitialized data. That's
a) undefined behaviour
b) means the return value is basically random
c) often will just CRC a lot of zeroes

From da26645ec8515e0e6d91e2311a83c3bb6649017e Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v6 2/6] Move all HW checks to common file.

Would be good to actually include a justification here.

--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;

Shouldn't this be in some x86 sepcific ifdef?

+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using the intrinsic functions:
+
+# (We don't test the 8-byte variant, _mm_crc32_u64, but it is assumed to
+# be present if the other ones are, on x86-64 platforms)
+#
+# An optional compiler flag can be passed as arguments (e.g. -msse4.2
+# -mavx512vl -mvpclmulqdq). If the intrinsics are supported, sets
+# pgac_avx512_crc32_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128, _mm512_clmulepi64_epi128... with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [const unsigned long k1k2[[8]] = {
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86,
+  0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+  unsigned char buffer[[512]];
+  unsigned char *aligned = (unsigned char*)(((size_t)buffer + 64L) & 0xffffffffffc0L);
+  unsigned long val;
+  __m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+  __m128i a1, a2;
+  unsigned int crc = 0xffffffff;
+  y8 = _mm512_load_si512((__m512i *)aligned);
+  x0 = _mm512_loadu_si512((__m512i *)k1k2);
+  x1 = _mm512_loadu_si512((__m512i *)(buffer + 0x00));
+  x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+  x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+  x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+  a1 = _mm512_extracti32x4_epi32(x1, 3);
+  a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+  x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+  val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+  crc = (unsigned int)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+  return crc != 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1"
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+

Why is all this stuff needed inside a configure check? We don't need to check
entire algorithms to check if we can build and link sepcific instructions, no?

From a495124ee42cb8f9f206f719b9f2235aff715963 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v6 5/6] use __attribute__((target(...))) for AVX-512 stuff

Huh, so now we're undoing a bunch of stuff done earlier. Makes this series
pretty hard to review.

Greetings,

Andres Freund

#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#39)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Thu, Nov 07, 2024 at 11:05:14AM -0500, Andres Freund wrote:

On 2024-10-30 21:03:20 +0000, Devulapalli, Raghuveer wrote:

From a495124ee42cb8f9f206f719b9f2235aff715963 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v6 5/6] use __attribute__((target(...))) for AVX-512 stuff

Huh, so now we're undoing a bunch of stuff done earlier. Makes this series
pretty hard to review.

I'm planning to commit this one very soon (it's being tracked in a separate
thread [0]/messages/by-id/ZywlZzPcPnlqKvt5@nathan), so this patch series will need rebasing, anyway. I think we
should use __attribute__((target(...))) right away for $SUBJECT instead of
undoing stuff in later patches.

[0]: /messages/by-id/ZywlZzPcPnlqKvt5@nathan

--
nathan

#41Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#38)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Would you mind moving the function attribute change for the existing SSE
4.2 code to its own patch? I think that is pretty straightforward, and IMHO it'd be
nice to take care of it first so that we can focus on the new stuff.

Just submitted a separate patch for this. Will update the CRC32C patch once this is committed.

Raghuveer

#42Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Andres Freund (#39)
3 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

Needs to be integrated with the meson based build as well.

Done.

+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	pg_crc32c		crc		= 0xFFFFFFFF;
+	const char*		data	= malloc((size_t)num);

This is computing a crc of uninitialized data. That's
a) undefined behaviour
b) means the return value is basically random
c) often will just CRC a lot of zeroes

Good point. I added random data to the buffer before computing the crc value and verified that this didn't affect the benchmark numbers.

From da26645ec8515e0e6d91e2311a83c3bb6649017e Mon Sep 17 00:00:00

2001

From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v6 2/6] Move all HW checks to common file.

Would be good to actually include a justification here.

Added a comment for this.

+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code
+below. */ typedef unsigned int exx_t; typedef enum {
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;

Shouldn't this be in some x86 specific ifdef?

The updated version has the #ifdef x86/x86_64 guard.

+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+

Why is all this stuff needed inside a configure check? We don't need to check
entire algorithms to check if we can build and link sepcific instructions, no?

Yup, this is unnecessary. I have modified the checks in meson and configure to keep just couple of instructions to test for _mm512_clmulepi64_epi128 (vpclmulqdq) and _mm_xor_epi64 (avx512vl) instructions only.

From a495124ee42cb8f9f206f719b9f2235aff715963 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v6 5/6] use __attribute__((target(...))) for AVX-512
stuff

Huh, so now we're undoing a bunch of stuff done earlier. Makes this series pretty
hard to review.

As Nathan suggested, we moved this to a separate thread. The latest set of patches here need to applied on top of patches in that thread.

Raghuveer

Attachments:

v7-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchapplication/octet-stream; name=v7-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload
From 590567435446e0c4a82a3ccc5169a7acd0cd6f03 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v7 1/3] Add a Postgres SQL function for crc32c benchmarking.

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_crc32c/Makefile         | 20 ++++++++
 src/test/modules/test_crc32c/meson.build      | 22 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 47 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 6 files changed, 95 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/meson.build
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b61953..68d8904dd0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_crc32c')
 subdir('test_ddl_deparse')
 subdir('test_dsa')
 subdir('test_dsm_registry')
diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/meson.build b/src/test/modules/test_crc32c/meson.build
new file mode 100644
index 0000000000..7021a6d6cf
--- /dev/null
+++ b/src/test/modules/test_crc32c/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_crc32c
+
+test_install_data += files(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+)
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.43.0

v7-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchapplication/octet-stream; name=v7-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchDownload
From a29c7f0c33970547be505a6fe1846806c1ccaf9f Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v7 2/3] Refactor: consolidate x86 ISA and OS runtime checks

Move all x86 ISA and OS runtime checks into a single file for improved
modularity and easier future maintenance.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/include/port/pg_bitutils.h      |   1 -
 src/include/port/pg_hw_feat_check.h |  33 ++++++
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   1 +
 src/port/pg_bitutils.c              |  22 +---
 src/port/pg_crc32c_sse42_choose.c   |  29 +----
 src/port/pg_hw_feat_check.c         | 163 ++++++++++++++++++++++++++++
 src/port/pg_popcount_avx512.c       |  78 -------------
 8 files changed, 202 insertions(+), 126 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..6088b56b71 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 37d12cbd8f..9275ae1239 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -10,6 +10,7 @@ pgport_sources = [
   'pg_popcount_avx512.c',
   'pg_crc32c_sse42_choose.c',
   'pg_crc32c_sse42.c',
+  'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 50ae82b312..84f82053ff 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -23,31 +23,8 @@
 
 #if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -64,6 +41,6 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
 
-#endif // USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+#endif
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..260aa60502
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,163 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
+
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index c8a4f2b19f..1123a1a634 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,16 +14,7 @@
 
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #include <immintrin.h>
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
 
 /*
@@ -33,75 +24,6 @@
  */
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.43.0

v7-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchapplication/octet-stream; name=v7-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchDownload
From a28b2e5da6a6530e8b82e8b8ba72d289969221ef Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Thu, 21 Nov 2024 12:42:09 -0800
Subject: [PATCH v7 3/3] Add AVX-512 CRC32C algorithm with a runtime check

Adds pg_crc32c_avx512(): compute the crc32c of the buffer, where the
buffer length must be at least 256, and a multiple of 64. Based on:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009"

Benchmark numbers to compare against the SSE4.2 CRC32C algorithm was
generated by using the drive_crc32c() function added in
src/test/modules/test_crc32c/test_crc32c.c.

+------------------+----------------+----------------+------------------+-------+------+
| Rate in bytes/us |    SDP (SPR)   |       m6i      |       m7i        |       |      |
+------------------+----------------+----------------+------------------+ Multi-|      |
| higher is better | SSE42  | AVX512 | SSE42 | AVX512 | SSE42  | AVX512 | plier |  %   |
+==================+=================+=======+========+========+========+=======+======+
| AVG Rate 64-8192 | 10,095 | 82,101 | 8,591 | 38,652 | 11,867 | 83,194 | 6.68  | 568% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+
| AVG Rate 64-255  |  9,034 |  9,136 | 7,619 |  7,437 |  9,030 |  9,293 | 1.01  |   1% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+

Co-authored-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  31 +++++
 configure                           | 152 ++++++++++++---------
 configure.ac                        | 107 +++++++--------
 meson.build                         |  22 +++
 src/include/pg_config.h.in          |   3 +
 src/include/pg_cpu.h                |  23 ++++
 src/include/port/pg_crc32c.h        |  55 +++-----
 src/include/port/pg_hw_feat_check.h |   6 +
 src/port/meson.build                |   7 +-
 src/port/pg_crc32c_avx512.c         | 202 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42_choose.c   |  46 -------
 src/port/pg_crc32c_x86_choose.c     |  57 ++++++++
 src/port/pg_hw_feat_check.c         |  75 ++++++++++-
 13 files changed, 571 insertions(+), 215 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 create mode 100644 src/port/pg_crc32c_avx512.c
 delete mode 100644 src/port/pg_crc32c_sse42_choose.c
 create mode 100644 src/port/pg_crc32c_x86_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 7c1a37e639..b717a51b3d 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -631,6 +631,37 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
+
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using intrinsics with function __attribute__((target("..."))):
+
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+    }],
+  [return crc32_avx512_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
+
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index e050a7dfc5..ac2541aa1b 100755
--- a/configure
+++ b/configure
@@ -17373,7 +17373,7 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 # Check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
 # with the __attribute__((target("sse4.2"))).
@@ -17419,6 +17419,50 @@ if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
 fi
 
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+    }
+int
+main ()
+{
+return crc32_avx512_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics=yes
+else
+  pgac_cv_avx512_crc32_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17620,9 +17664,8 @@ fi
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
-# implementations and select which one to use at runtime, depending on whether
-# SSE 4.2 is supported by the processor we're running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -17634,95 +17677,80 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-    else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
 
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-        fi
       fi
     fi
   fi
 fi
 
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/configure.ac b/configure.ac
index 91ff29fb8a..db36d20496 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2067,12 +2067,16 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 # Check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
 # with the __attribute__((target("sse4.2"))).
 PGAC_SSE42_CRC32_INTRINSICS([])
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+PGAC_AVX512_CRC32_INTRINSICS([])
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2108,9 +2112,8 @@ AC_SUBST(CFLAGS_CRC)
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
-# implementations and select which one to use at runtime, depending on whether
-# SSE 4.2 is supported by the processor we're running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -2122,76 +2125,58 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
-        else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/meson.build b/meson.build
index bd247fbabf..479ecc6647 100644
--- a/meson.build
+++ b/meson.build
@@ -2231,6 +2231,22 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     have_optimized_crc = true
   else
 
+    avx512_crc_prog = '''
+#include <immintrin.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vl,vpclmulqdq")))
+#endif
+int main(void)
+{
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+}
+'''
+
     sse42_crc_prog = '''
 #include <nmmintrin.h>
 #if defined(__has_attribute) && __has_attribute (target)
@@ -2260,6 +2276,12 @@ int main(void)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+        name: 'AVX512 CRC32C with function attributes',
+        args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 40e4b2e381..6cc21c7942 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -715,6 +715,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use Intel AVX-512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..690273506b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -34,58 +34,43 @@
 #define PG_CRC32C_H
 
 #include "port/pg_bswap.h"
+#include "pg_cpu.h"
 
 typedef uint32 pg_crc32c;
 
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
-
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
@@ -98,13 +83,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..3a73014987 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,10 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_avx512_available(void);
 #endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index 9275ae1239..0ba4a56194 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,8 +8,10 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
-  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
   'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
   'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
@@ -83,9 +85,6 @@ endif
 # Replacement functionality to be built if corresponding configure symbol
 # is true
 replace_funcs_pos = [
-  # x86/x64
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..d8247e2e33
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,202 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl,vpclmulqdq")
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes with legacy AVX algorithm.
+	 */
+	return pg_comp_crc32c_sse42(crc, input, length);
+}
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
deleted file mode 100644
index 84f82053ff..0000000000
--- a/src/port/pg_crc32c_sse42_choose.c
+++ /dev/null
@@ -1,46 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-
-#include "c.h"
-
-#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_sse42_available())
-		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
-
-#endif
diff --git a/src/port/pg_crc32c_x86_choose.c b/src/port/pg_crc32c_x86_choose.c
new file mode 100644
index 0000000000..3ce8be11a6
--- /dev/null
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_x86_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ * (1) set pg_comp_crc32c pointer and (2) return the computed crc value
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available()) {
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+        return pg_comp_crc32c(crc, data, len);
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available()) {
+                pg_comp_crc32c = pg_comp_crc32c_sse42;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+        return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+
+#endif // x86/x86_64
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 260aa60502..b2872fa708 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -11,6 +11,9 @@
  *-------------------------------------------------------------------------
  */
 #include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
 
 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
 #include <cpuid.h>
@@ -135,9 +138,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -154,10 +208,19 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+bool PGDLLIMPORT
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
 
+#endif // #if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
-- 
2.43.0

#43Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#42)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Mon, Nov 25, 2024 at 08:54:48PM +0000, Devulapalli, Raghuveer wrote:

As Nathan suggested, we moved this to a separate thread. The latest set
of patches here need to applied on top of patches in that thread.

Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch
is committed?

--
nathan

#44Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#43)
3 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch is
committed?

Rebased to master branch.

Raghuveer

Attachments:

v8-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchapplication/octet-stream; name=v8-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload
From 377143412131e8c5bdd3c6a327e45ed8f09b4c81 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v8 1/3] Add a Postgres SQL function for crc32c benchmarking.

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_crc32c/Makefile         | 20 ++++++++
 src/test/modules/test_crc32c/meson.build      | 22 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 47 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 6 files changed, 95 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/meson.build
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b61953..68d8904dd0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_crc32c')
 subdir('test_ddl_deparse')
 subdir('test_dsa')
 subdir('test_dsm_registry')
diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/meson.build b/src/test/modules/test_crc32c/meson.build
new file mode 100644
index 0000000000..7021a6d6cf
--- /dev/null
+++ b/src/test/modules/test_crc32c/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_crc32c
+
+test_install_data += files(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+)
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.43.0

v8-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchapplication/octet-stream; name=v8-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchDownload
From 55a1c85ff3747036a1dd3d84b01c9d73fbae8765 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v8 2/3] Refactor: consolidate x86 ISA and OS runtime checks

Move all x86 ISA and OS runtime checks into a single file for improved
modularity and easier future maintenance.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/include/port/pg_bitutils.h      |   1 -
 src/include/port/pg_hw_feat_check.h |  33 ++++++
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   3 +
 src/port/pg_bitutils.c              |  22 +---
 src/port/pg_crc32c_sse42_choose.c   |  21 +---
 src/port/pg_hw_feat_check.c         | 163 ++++++++++++++++++++++++++++
 src/port/pg_popcount_avx512.c       |  78 -------------
 8 files changed, 205 insertions(+), 117 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..6088b56b71 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index c5bceed9cd..ec28590473 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,9 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_sse42.c',
+  'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..c659917af0 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,6 +20,7 @@
 
 #include "c.h"
 
+#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 #ifdef HAVE__GET_CPUID
 #include <cpuid.h>
 #endif
@@ -29,22 +30,7 @@
 #endif
 
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +47,5 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+#endif
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..260aa60502
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,163 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
+
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index c8a4f2b19f..1123a1a634 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,16 +14,7 @@
 
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #include <immintrin.h>
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
 
 /*
@@ -33,75 +24,6 @@
  */
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.43.0

v8-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchapplication/octet-stream; name=v8-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchDownload
From a54a3f6964f919ff65c50f4eadedae802bcc0689 Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Thu, 21 Nov 2024 12:42:09 -0800
Subject: [PATCH v8 3/3] Add AVX-512 CRC32C algorithm with a runtime check

Adds pg_crc32c_avx512(): compute the crc32c of the buffer, where the
buffer length must be at least 256, and a multiple of 64. Based on:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009"

Benchmark numbers to compare against the SSE4.2 CRC32C algorithm was
generated by using the drive_crc32c() function added in
src/test/modules/test_crc32c/test_crc32c.c.

+------------------+----------------+----------------+------------------+-------+------+
| Rate in bytes/us |    SDP (SPR)   |       m6i      |       m7i        |       |      |
+------------------+----------------+----------------+------------------+ Multi-|      |
| higher is better | SSE42  | AVX512 | SSE42 | AVX512 | SSE42  | AVX512 | plier |  %   |
+==================+=================+=======+========+========+========+=======+======+
| AVG Rate 64-8192 | 10,095 | 82,101 | 8,591 | 38,652 | 11,867 | 83,194 | 6.68  | 568% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+
| AVG Rate 64-255  |  9,034 |  9,136 | 7,619 |  7,437 |  9,030 |  9,293 | 1.01  |   1% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+

Co-authored-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  30 +++++
 configure                           | 152 ++++++++++++---------
 configure.ac                        | 107 +++++++--------
 meson.build                         |  22 +++
 src/include/pg_config.h.in          |   3 +
 src/include/pg_cpu.h                |  23 ++++
 src/include/port/pg_crc32c.h        |  55 +++-----
 src/include/port/pg_hw_feat_check.h |   6 +
 src/port/meson.build                |  10 +-
 src/port/pg_crc32c_avx512.c         | 202 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42_choose.c   |  51 -------
 src/port/pg_crc32c_x86_choose.c     |  57 ++++++++
 src/port/pg_hw_feat_check.c         |  75 ++++++++++-
 13 files changed, 570 insertions(+), 223 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 create mode 100644 src/port/pg_crc32c_avx512.c
 delete mode 100644 src/port/pg_crc32c_sse42_choose.c
 create mode 100644 src/port/pg_crc32c_x86_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 309d5b04b4..ede57a2ae5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -631,6 +631,36 @@ undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using intrinsics with function __attribute__((target("..."))):
+
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+    }],
+  [return crc32_avx512_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
+
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index ff59f1422d..8e7ff5bd96 100755
--- a/configure
+++ b/configure
@@ -17321,7 +17321,7 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
 $as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32... " >&6; }
@@ -17365,6 +17365,50 @@ if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
 fi
 
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+    }
+int
+main ()
+{
+return crc32_avx512_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics=yes
+else
+  pgac_cv_avx512_crc32_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17566,9 +17610,8 @@ fi
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -17585,95 +17628,80 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-    else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
 
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-        fi
       fi
     fi
   fi
 fi
 
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/configure.ac b/configure.ac
index 2181700964..815deb60b2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2057,10 +2057,14 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+PGAC_AVX512_CRC32_INTRINSICS([])
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2096,9 +2100,8 @@ AC_SUBST(CFLAGS_CRC)
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -2115,76 +2118,58 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
-        else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/meson.build b/meson.build
index 451c3f6d85..5325aa1106 100644
--- a/meson.build
+++ b/meson.build
@@ -2229,6 +2229,22 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     have_optimized_crc = true
   else
 
+    avx512_crc_prog = '''
+#include <immintrin.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vl,vpclmulqdq")))
+#endif
+int main(void)
+{
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      return _mm_extract_epi32(a1, 0x0);
+}
+'''
+
     prog = '''
 #include <nmmintrin.h>
 
@@ -2259,6 +2275,12 @@ int main(void)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+        name: 'AVX512 CRC32C with function attributes',
+        args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ab0f8cc2b4..55f16683c3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -706,6 +706,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use Intel AVX-512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..690273506b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -34,58 +34,43 @@
 #define PG_CRC32C_H
 
 #include "port/pg_bswap.h"
+#include "pg_cpu.h"
 
 typedef uint32 pg_crc32c;
 
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
-
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
@@ -98,13 +83,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..3a73014987 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,10 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_avx512_available(void);
 #endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index ec28590473..0ba4a56194 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,8 +8,10 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
-  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
   'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
   'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
@@ -83,12 +85,6 @@ endif
 # Replacement functionality to be built if corresponding configure symbol
 # is true
 replace_funcs_pos = [
-  # x86/x64
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..d8247e2e33
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,202 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl,vpclmulqdq")
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes with legacy AVX algorithm.
+	 */
+	return pg_comp_crc32c_sse42(crc, input, length);
+}
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
deleted file mode 100644
index c659917af0..0000000000
--- a/src/port/pg_crc32c_sse42_choose.c
+++ /dev/null
@@ -1,51 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_sse42_available())
-		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
-#endif
diff --git a/src/port/pg_crc32c_x86_choose.c b/src/port/pg_crc32c_x86_choose.c
new file mode 100644
index 0000000000..3ce8be11a6
--- /dev/null
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_x86_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ * (1) set pg_comp_crc32c pointer and (2) return the computed crc value
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available()) {
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+        return pg_comp_crc32c(crc, data, len);
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available()) {
+                pg_comp_crc32c = pg_comp_crc32c_sse42;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+        return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+
+#endif // x86/x86_64
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 260aa60502..b2872fa708 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -11,6 +11,9 @@
  *-------------------------------------------------------------------------
  */
 #include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
 
 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
 #include <cpuid.h>
@@ -135,9 +138,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -154,10 +208,19 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+bool PGDLLIMPORT
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
 
+#endif // #if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
-- 
2.43.0

#45Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#44)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Tue, Dec 03, 2024 at 03:46:16PM +0000, Devulapalli, Raghuveer wrote:

Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch is
committed?

Rebased to master branch.

Thanks! cfbot is showing a couple of errors [0]https://cirrus-ci.com/task/6023394207989760 [1]https://cirrus-ci.com/task/5460444254568448 [2]https://cirrus-ci.com/task/6586344161411072. 32-bit Linux is
failing to compile with the 64-bit intrinsics. I think it'd be fine to
limi this optimization to 64-bit builds unless the code can be easily fixed
to work for both. The macOS build seems to be trying to include the x86
headers, which is producing many errors. We'll need to make sure that none
of this code is being compiled on ARM machine. The Windows build seems to
be unable to resolve the pg_comp_crc32c symbol, but it is not immediately
obvious to me why.

[0]: https://cirrus-ci.com/task/6023394207989760
[1]: https://cirrus-ci.com/task/5460444254568448
[2]: https://cirrus-ci.com/task/6586344161411072

--
nathan

#46Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#45)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Thanks! cfbot is showing a couple of errors [0] [1] [2].

Oh yikes, the CI had passed with an earlier version. Wonder if I made a mess of the rebase. I will take a look and fix them.

Raghuveer

#47John Naylor
johncnaylorls@gmail.com
In reply to: Andres Freund (#8)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Thu, Jun 13, 2024 at 3:11 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-05-01 15:56:08 +0000, Amonson, Paul D wrote:

Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192

This is extremely workload dependent, it's not hard to find workloads with
lots of very small record and very few big ones... What you observed might
have "just" been the warmup behaviour where more full page writes have to be
written.

Sorry for going back so far, but this thread was pointed out to me,
and this aspect of the design could use some more discussion:

+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based

There is another technique that computes CRC on 3 separate chunks and
combines them at the end, so about 3x faster on large-enough chunks.
That's the way used for the Arm proposal [0]https://commitfest.postgresql.org/50/4620/, coincidentally also
citing a white paper from Intel, but as Dimitry pointed out in that
thread, its link has apparently disappeared. Raghuveer, do you know
about this, and is there another link available?

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

The cut off point in one implementation is only 144 bytes [1]https://github.com/komrad36/CRC/blob/master/CRC/golden_intel.cpp#L138C27-L138C42 , which
is maybe not as small as we'd like, but is quite a bit smaller than
256. That seems better suited to our workloads, and more portable. I
have a *brand-new* laptop with an Intel chip, and IIUC it doesn't
support AVX-512 because it uses a big-little architecture. I also
understand that Sierra Forrest (a server product line) will be all
little cores with no AVX-512 support, so I'm not sure why the proposal
here requires AVX-512.

There a very frequent call computing COMP_CRC32C over just 20 bytes, while
holding a crucial lock. If we were to do introduce something like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in case of
compile-time known small lengths.

I know you've read an earlier version of the patch and realized that
it wouldn't help here, but we could probably dispatch differently
regardless, although it may only be worth it if we can inline the
instructions. Since we technically only need to wait for xl_prev, I
believe we could push the computation of the other 12 bytes to before
acquiring the lock, then only execute a single instruction on xl_prev
to complete the CRC computation. Is there any reason why we couldn't
do that, assuming we have a clean way to make that portable? That
would mean that the CRCs between major versions would be different,
but I think we don't guarantee that anyway.

[0]: https://commitfest.postgresql.org/50/4620/
[1]: https://github.com/komrad36/CRC/blob/master/CRC/golden_intel.cpp#L138C27-L138C42

--
John Naylor
Amazon Web Services

#48Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#45)
3 attachment(s)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

[0] https://cirrus-ci.com/task/6023394207989760
[1] https://cirrus-ci.com/task/5460444254568448
[2] https://cirrus-ci.com/task/6586344161411072

I was able to fix [0] and [1], but I can't think of why [2] fails. When I tried to reproduce this locally, I get a different unrelated error. Any idea why I am seeing this?

LINK : fatal error LNK1181: cannot open input file 'C:\Program Files\Git\nologo'

Commands: meson setup build && cd build && meson compile

Attachments:

v9-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchapplication/octet-stream; name=v9-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload
From d2c8e4e9d97fcdb72242e9a4d8a438246170388b Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v9 1/3] Add a Postgres SQL function for crc32c benchmarking.

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_crc32c/Makefile         | 20 ++++++++
 src/test/modules/test_crc32c/meson.build      | 22 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 47 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 6 files changed, 95 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/meson.build
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b61953..68d8904dd0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_crc32c')
 subdir('test_ddl_deparse')
 subdir('test_dsa')
 subdir('test_dsm_registry')
diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/meson.build b/src/test/modules/test_crc32c/meson.build
new file mode 100644
index 0000000000..7021a6d6cf
--- /dev/null
+++ b/src/test/modules/test_crc32c/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_crc32c
+
+test_install_data += files(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+)
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.43.0

v9-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchapplication/octet-stream; name=v9-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-check.patchDownload
From 9c6e206160c966df899121c81e2e79d3a3b95d89 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v9 2/3] Refactor: consolidate x86 ISA and OS runtime checks

Move all x86 ISA and OS runtime checks into a single file for improved
modularity and easier future maintenance.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/include/port/pg_bitutils.h      |   1 -
 src/include/port/pg_hw_feat_check.h |  33 ++++++
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   3 +
 src/port/pg_bitutils.c              |  22 +---
 src/port/pg_crc32c_sse42_choose.c   |  21 +---
 src/port/pg_hw_feat_check.c         | 163 ++++++++++++++++++++++++++++
 src/port/pg_popcount_avx512.c       |  78 -------------
 8 files changed, 205 insertions(+), 117 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..263f27930d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..6088b56b71 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index c5bceed9cd..ec28590473 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,9 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_sse42.c',
+  'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..b2823d5732 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..c659917af0 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,6 +20,7 @@
 
 #include "c.h"
 
+#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 #ifdef HAVE__GET_CPUID
 #include <cpuid.h>
 #endif
@@ -29,22 +30,7 @@
 #endif
 
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +47,5 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+#endif
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..260aa60502
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,163 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
+
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index c8a4f2b19f..1123a1a634 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,16 +14,7 @@
 
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #include <immintrin.h>
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
 
 /*
@@ -33,75 +24,6 @@
  */
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.43.0

v9-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchapplication/octet-stream; name=v9-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-check.patchDownload
From 45210dd0ec6902a3408fc6716f6a5b06ee521e43 Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Thu, 21 Nov 2024 12:42:09 -0800
Subject: [PATCH v9 3/3] Add AVX-512 CRC32C algorithm with a runtime check

Adds pg_crc32c_avx512(): compute the crc32c of the buffer, where the
buffer length must be at least 256, and a multiple of 64. Based on:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009"

Benchmark numbers to compare against the SSE4.2 CRC32C algorithm was
generated by using the drive_crc32c() function added in
src/test/modules/test_crc32c/test_crc32c.c.

+------------------+----------------+----------------+------------------+-------+------+
| Rate in bytes/us |    SDP (SPR)   |       m6i      |       m7i        |       |      |
+------------------+----------------+----------------+------------------+ Multi-|      |
| higher is better | SSE42  | AVX512 | SSE42 | AVX512 | SSE42  | AVX512 | plier |  %   |
+==================+=================+=======+========+========+========+=======+======+
| AVG Rate 64-8192 | 10,095 | 82,101 | 8,591 | 38,652 | 11,867 | 83,194 | 6.68  | 568% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+
| AVG Rate 64-255  |  9,034 |  9,136 | 7,619 |  7,437 |  9,030 |  9,293 | 1.01  |   1% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+

Co-authored-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  32 +++++
 configure                           | 154 ++++++++++++---------
 configure.ac                        | 107 +++++++--------
 meson.build                         |  23 ++++
 src/include/pg_config.h.in          |   3 +
 src/include/pg_cpu.h                |  23 ++++
 src/include/port/pg_crc32c.h        |  55 +++-----
 src/include/port/pg_hw_feat_check.h |   6 +
 src/port/meson.build                |  10 +-
 src/port/pg_crc32c_avx512.c         | 203 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42.c          |   2 +
 src/port/pg_crc32c_sse42_choose.c   |  51 -------
 src/port/pg_crc32c_x86_choose.c     |  57 ++++++++
 src/port/pg_hw_feat_check.c         |  75 +++++++++-
 14 files changed, 578 insertions(+), 223 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 create mode 100644 src/port/pg_crc32c_avx512.c
 delete mode 100644 src/port/pg_crc32c_sse42_choose.c
 create mode 100644 src/port/pg_crc32c_x86_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 309d5b04b4..a344f185b7 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -631,6 +631,38 @@ undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using intrinsics with function __attribute__((target("..."))):
+
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    #include <stdint.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+    }],
+  [return crc32_avx512_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
+
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index ff59f1422d..4cc2197466 100755
--- a/configure
+++ b/configure
@@ -17321,7 +17321,7 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
 $as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32... " >&6; }
@@ -17365,6 +17365,52 @@ if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
 fi
 
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+    #include <stdint.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+    }
+int
+main ()
+{
+return crc32_avx512_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics=yes
+else
+  pgac_cv_avx512_crc32_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17566,9 +17612,8 @@ fi
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -17585,95 +17630,80 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-    else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
 
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-        fi
       fi
     fi
   fi
 fi
 
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/configure.ac b/configure.ac
index 2181700964..815deb60b2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2057,10 +2057,14 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+PGAC_AVX512_CRC32_INTRINSICS([])
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2096,9 +2100,8 @@ AC_SUBST(CFLAGS_CRC)
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -2115,76 +2118,58 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
-        else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/meson.build b/meson.build
index 451c3f6d85..deb66df3e5 100644
--- a/meson.build
+++ b/meson.build
@@ -2229,6 +2229,23 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     have_optimized_crc = true
   else
 
+    avx512_crc_prog = '''
+#include <immintrin.h>
+#include <stdint.h>
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vl,vpclmulqdq")))
+#endif
+int main(void)
+{
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+}
+'''
+
     prog = '''
 #include <nmmintrin.h>
 
@@ -2259,6 +2276,12 @@ int main(void)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+        name: 'AVX512 CRC32C with function attributes',
+        args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ab0f8cc2b4..55f16683c3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -706,6 +706,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use Intel AVX-512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..690273506b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -34,58 +34,43 @@
 #define PG_CRC32C_H
 
 #include "port/pg_bswap.h"
+#include "pg_cpu.h"
 
 typedef uint32 pg_crc32c;
 
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
-
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
@@ -98,13 +83,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..3a73014987 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,10 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_avx512_available(void);
 #endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index ec28590473..0ba4a56194 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,8 +8,10 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
-  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
   'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
   'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
@@ -83,12 +85,6 @@ endif
 # Replacement functionality to be built if corresponding configure symbol
 # is true
 replace_funcs_pos = [
-  # x86/x64
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..ba4defcefd
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,203 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl,vpclmulqdq")
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes with legacy AVX algorithm.
+	 */
+	return pg_comp_crc32c_sse42(crc, input, length);
+}
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index dcc4904a82..90d155e804 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -14,6 +14,7 @@
  */
 #include "c.h"
 
+#if defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 #include <nmmintrin.h>
 
 #include "port/pg_crc32c.h"
@@ -68,3 +69,4 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+#endif
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
deleted file mode 100644
index c659917af0..0000000000
--- a/src/port/pg_crc32c_sse42_choose.c
+++ /dev/null
@@ -1,51 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_sse42_available())
-		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
-#endif
diff --git a/src/port/pg_crc32c_x86_choose.c b/src/port/pg_crc32c_x86_choose.c
new file mode 100644
index 0000000000..3ce8be11a6
--- /dev/null
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_x86_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ * (1) set pg_comp_crc32c pointer and (2) return the computed crc value
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available()) {
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+        return pg_comp_crc32c(crc, data, len);
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available()) {
+                pg_comp_crc32c = pg_comp_crc32c_sse42;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+        return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+
+#endif // x86/x86_64
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 260aa60502..b2872fa708 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -11,6 +11,9 @@
  *-------------------------------------------------------------------------
  */
 #include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
 
 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
 #include <cpuid.h>
@@ -135,9 +138,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -154,10 +208,19 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+bool PGDLLIMPORT
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
 
+#endif // #if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
-- 
2.43.0

#49Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: John Naylor (#47)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Sorry for going back so far, but this thread was pointed out to me, and this aspect
of the design could use some more discussion:

+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based

There is another technique that computes CRC on 3 separate chunks and
combines them at the end, so about 3x faster on large-enough chunks.
That's the way used for the Arm proposal [0], coincidentally also citing a white
paper from Intel, but as Dimitry pointed out in that thread, its link has apparently
disappeared. Raghuveer, do you know about this, and is there another link
available?

http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

I am not aware of this paper. Let me poke a few people internally and get back to you on this.

The cut off point in one implementation is only 144 bytes [1] , which is maybe not
as small as we'd like, but is quite a bit smaller than 256. That seems better suited
to our workloads, and more portable. I have a *brand-new* laptop with an Intel
chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture.
I also understand that Sierra Forrest (a server product line) will be all little cores
with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
512.

AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See:

1) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=pclmul&amp;ig_expand=754&amp;techs=AVX_ALL
2) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=754&amp;techs=AVX_ALL&amp;text=crc32

There a very frequent call computing COMP_CRC32C over just 20 bytes,
while holding a crucial lock. If we were to do introduce something
like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in
case of compile-time known small lengths.

I know you've read an earlier version of the patch and realized that it wouldn't
help here, but we could probably dispatch differently regardless, although it may
only be worth it if we can inline the instructions. Since we technically only need to
wait for xl_prev, I believe we could push the computation of the other 12 bytes to
before acquiring the lock, then only execute a single instruction on xl_prev to
complete the CRC computation. Is there any reason why we couldn't do that,
assuming we have a clean way to make that portable? That would mean that the
CRCs between major versions would be different, but I think we don't guarantee
that anyway.

Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am on travel with limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is this a blocker for the patch or is this something we can address as a revision?

Raghuveer

#50John Naylor
johncnaylorls@gmail.com
In reply to: Devulapalli, Raghuveer (#49)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

There is another technique that computes CRC on 3 separate chunks and
combines them at the end, so about 3x faster on large-enough chunks.
That's the way used for the Arm proposal [0], coincidentally also citing a white
paper from Intel, but as Dimitry pointed out in that thread, its link has apparently
disappeared. Raghuveer, do you know about this, and is there another link
available?

http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

I am not aware of this paper. Let me poke a few people internally and get back to you on this.

Thanks! I have a portable PoC of how this works, but I'll save that
for another thread, since it's not Intel (or Arm) specific.

The cut off point in one implementation is only 144 bytes [1] , which is maybe not
as small as we'd like, but is quite a bit smaller than 256. That seems better suited
to our workloads, and more portable. I have a *brand-new* laptop with an Intel
chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture.
I also understand that Sierra Forrest (a server product line) will be all little cores
with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
512.

AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See:

CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics
used in the patch are not CRC-specific, if I understand correctly.

My point was, it seems Intel still considers AVX-512 as optional, so
we can't count on it being present even in future chips. That's why
I'm interested in alternatives, at least as a first step. If we can
get 3x throughput, the calculation might bend up low enough in the
profile that going to 6x might not be noticeable (not sure).

There a very frequent call computing COMP_CRC32C over just 20 bytes,
while holding a crucial lock. If we were to do introduce something
like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in
case of compile-time known small lengths.

I know you've read an earlier version of the patch and realized that it wouldn't
help here, but we could probably dispatch differently regardless, although it may
only be worth it if we can inline the instructions. Since we technically only need to
wait for xl_prev, I believe we could push the computation of the other 12 bytes to
before acquiring the lock, then only execute a single instruction on xl_prev to
complete the CRC computation. Is there any reason why we couldn't do that,
assuming we have a clean way to make that portable? That would mean that the
CRCs between major versions would be different, but I think we don't guarantee
that anyway.

Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am on travel with limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is this a blocker for the patch or is this something we can address as a revision?

This is orthogonal and is not related to the patch, since it doesn't
affect 8 and 20-byte paths, only 256 and greater.

--
John Naylor
Amazon Web Services

#51John Naylor
johncnaylorls@gmail.com
In reply to: Bruce Momjian (#15)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

+ * For This Function:
+ * Copyright 2015 The Chromium Authors

I went and looked at the Chromium source, and found the following
snippet that uses the same technique, but only requires 128-bit CLMUL
and has a minimum input size of 64 bytes, rather than 256. This seems
like it might be better suited for shorter inputs. Also seems much
easier than trying to get the AVX-512 hippo to dance. It uses the IEEE
polynomial, so would need new constants calculated for ours, but that
had to be done for the shared patch, too.

https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd.c#L215

--
John Naylor
Amazon Web Services

#52Andres Freund
andres@anarazel.de
In reply to: John Naylor (#51)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

On 2024-12-12 18:32:20 +0700, John Naylor wrote:

I went and looked at the Chromium source, and found the following
snippet that uses the same technique, but only requires 128-bit CLMUL
and has a minimum input size of 64 bytes, rather than 256. This seems
like it might be better suited for shorter inputs. Also seems much
easier than trying to get the AVX-512 hippo to dance. It uses the IEEE
polynomial, so would need new constants calculated for ours, but that
had to be done for the shared patch, too.

Frankly, we should just move away from using CRCs. They're good for cases
where short runs of bit flips are much more likely than other kinds of errors
and where the amount of data covered by them has a low upper bound. That's not
at all the case for WAL records. It'd not matter too much if CRCs were cheap
to compute - but they aren't. We should instead move to some more generic
hashing algorithm, decent ones are much faster.

Greetings,

Andres

#53Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#52)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Thu, Dec 12, 2024 at 10:45:29AM -0500, Andres Freund wrote:

Frankly, we should just move away from using CRCs. They're good for cases
where short runs of bit flips are much more likely than other kinds of errors
and where the amount of data covered by them has a low upper bound. That's not
at all the case for WAL records. It'd not matter too much if CRCs were cheap
to compute - but they aren't. We should instead move to some more generic
hashing algorithm, decent ones are much faster.

Upthread [0]/messages/by-id/ZrUcX2kq-0doNBea@nathan, I wondered aloud about trying to reuse the page checksum code
for this. IIRC there was a lot of focus on performance when that was
added, and IME it catches problems decently well.

[0]: /messages/by-id/ZrUcX2kq-0doNBea@nathan

--
nathan

#54Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#53)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Fri, 13 Dec 2024 at 00:14, Nathan Bossart <nathandbossart@gmail.com> wrote:

On Thu, Dec 12, 2024 at 10:45:29AM -0500, Andres Freund wrote:

Frankly, we should just move away from using CRCs. They're good for cases
where short runs of bit flips are much more likely than other kinds of errors
and where the amount of data covered by them has a low upper bound. That's not
at all the case for WAL records. It'd not matter too much if CRCs were cheap
to compute - but they aren't. We should instead move to some more generic
hashing algorithm, decent ones are much faster.

Upthread [0], I wondered aloud about trying to reuse the page checksum code
for this. IIRC there was a lot of focus on performance when that was
added, and IME it catches problems decently well.

[0] /messages/by-id/ZrUcX2kq-0doNBea@nathan

It was carefully built to allow compiler auto-vectorization for power
of 2 block sizes to run fast on any CPU that has fast vectorized 32
bit multiplication instructions.

Performance is great, if compiled with -march=native it gets 15.8
bytes/cycle on Zen 3. Compared to 19.5 for t1ha0_aes_avx2, 7.9 for
aes-ni hash, and 2.15 for fasthash32. However, it isn't particularly
good for small (<1K) blocks both for hash quality and performance
reasons.

One idea would be to use fasthash for short lengths and an extended
version of the page checksum for larger values. But before committing
to that approach, I think revisiting the quality of the page checksum
algorithm is due. Quality and robustness were not the highest
priorities when developing it.

--
Ants Aasma
Lead Database Consultant
www.cybertec-postgresql.com

#55John Naylor
johncnaylorls@gmail.com
In reply to: Andres Freund (#7)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Thu, Jun 13, 2024 at 2:37 AM Andres Freund <andres@anarazel.de> wrote:

It's hard to understand, but a nonetheless helpful page is
https://users.ece.cmu.edu/~koopman/crc/crc32.html which lists properties for
crc32c:
https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt
which lists
(0x8f6e37a0; 0x11edc6f41) <=> (0x82f63b78; 0x105ec76f1) {2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1} | gold | (*op) iSCSI; CRC-32C; CRC-32/4

This cryptic notion AFAIU indicates that for our polynomial we can detect 2bit
errors up to a length of 2147483615 bytes, 3 bit errors up to 2147483615, 3
and 4 bit errors up to 5243, 5 and 6 bit errors up to 177, 7/8 bit errors up
to 47.

One aspect of that cryptic notation that you seemed to have missed is
"(*op)" -- explained as:

*p - primitive polynomial. This has optimal length for HD=3, and good
HD=2 performance above that length.
*o - odd bit errors detected. This has a factor of (x+1) and detects
all odd bit errors (implying that even number of bit errors have an
elevated undetected error rate)
*op - odd bit errors detected plus primitive. This is a primitive
polynomial times (x+1). It has optimal length for HD=4, and detects
all odd bit errors.

This means it's not really a 32-bit checksum -- it's a 1-bit checksum
plus a 31-bit checksum. The 1-bit checksum can detect any odd number
of bit-flips. Do we really want to throw that property away?

Sure, for an even number bitflips beyond a small number, we're left
with the luck ordinary collisions, and CRC is not particularly great,
but for two messages of the same length, I'm also not sure it's all
that bad, either

--
John Naylor
Amazon Web Services

#56Andres Freund
andres@anarazel.de
In reply to: John Naylor (#55)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi,

On 2024-12-14 12:08:57 +0700, John Naylor wrote:

On Thu, Jun 13, 2024 at 2:37 AM Andres Freund <andres@anarazel.de> wrote:

It's hard to understand, but a nonetheless helpful page is
https://users.ece.cmu.edu/~koopman/crc/crc32.html which lists properties for
crc32c:
https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt
which lists
(0x8f6e37a0; 0x11edc6f41) <=> (0x82f63b78; 0x105ec76f1) {2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1} | gold | (*op) iSCSI; CRC-32C; CRC-32/4

This cryptic notion AFAIU indicates that for our polynomial we can detect 2bit
errors up to a length of 2147483615 bytes, 3 bit errors up to 2147483615, 3
and 4 bit errors up to 5243, 5 and 6 bit errors up to 177, 7/8 bit errors up
to 47.

One aspect of that cryptic notation that you seemed to have missed is
"(*op)" -- explained as:

*p - primitive polynomial. This has optimal length for HD=3, and good
HD=2 performance above that length.
*o - odd bit errors detected. This has a factor of (x+1) and detects
all odd bit errors (implying that even number of bit errors have an
elevated undetected error rate)
*op - odd bit errors detected plus primitive. This is a primitive
polynomial times (x+1). It has optimal length for HD=4, and detects
all odd bit errors.

This means it's not really a 32-bit checksum -- it's a 1-bit checksum
plus a 31-bit checksum. The 1-bit checksum can detect any odd number
of bit-flips. Do we really want to throw that property away?

I think it's pretty much irrelevant for our usecase.

What the WAL checksum needs to protect against are cases like a record
spanning >1 disk sectors or >1 OS pages and one of those sectors/pages not
having made it to disk, while the rest has made it (and thus shows old
contents).

That means we have to detect runs of "wrong content" that are *never* in the
single bit range (since sector boundaries never fall within a bit), *never*
within a 4 byte range (because that's what we IIRC align records to, and
again, sector boundaries don't fall within aligned 4 byte quantities).

Because the likely causes of failure are parts of the correct record and then
a tail or an intermittent long chunk (>= 1 sector) of wrong content, detecting
certain number of bit flips just doesn't help.

Bit flips are an important thing to detect and correct when they are something
that can happen in isolation. E.g. a bunch of interference in an ethernet
cable. Or the charge in an individual flash cell being a tiny bit above/below
some threshold. But that's just not what we have with WAL.

It's also worth noting that just about *all* permanent storage already has
applied sector-level checksums, protecting against (and correcting) bit flips
at that level.

Sure, for an even number bitflips beyond a small number, we're left
with the luck ordinary collisions, and CRC is not particularly great,

I.e. just about *all* failure scenarios for WAL.

but for two messages of the same length, I'm also not sure it's all
that bad, either

Our records rarely have the same length, no?

Greetings,

Andres Freund

#57John Naylor
johncnaylorls@gmail.com
In reply to: Andres Freund (#56)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Sat, Dec 14, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-12-14 12:08:57 +0700, John Naylor wrote:

On Thu, Jun 13, 2024 at 2:37 AM Andres Freund <andres@anarazel.de> wrote:

It's hard to understand, but a nonetheless helpful page is
https://users.ece.cmu.edu/~koopman/crc/crc32.html which lists properties for
crc32c:
https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt
which lists
(0x8f6e37a0; 0x11edc6f41) <=> (0x82f63b78; 0x105ec76f1) {2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1} | gold | (*op) iSCSI; CRC-32C; CRC-32/4

This cryptic notion AFAIU indicates that for our polynomial we can detect 2bit
errors up to a length of 2147483615 bytes, 3 bit errors up to 2147483615, 3
and 4 bit errors up to 5243, 5 and 6 bit errors up to 177, 7/8 bit errors up
to 47.

One aspect of that cryptic notation that you seemed to have missed is
"(*op)" -- explained as:

*p - primitive polynomial. This has optimal length for HD=3, and good
HD=2 performance above that length.
*o - odd bit errors detected. This has a factor of (x+1) and detects
all odd bit errors (implying that even number of bit errors have an
elevated undetected error rate)
*op - odd bit errors detected plus primitive. This is a primitive
polynomial times (x+1). It has optimal length for HD=4, and detects
all odd bit errors.

This means it's not really a 32-bit checksum -- it's a 1-bit checksum
plus a 31-bit checksum. The 1-bit checksum can detect any odd number
of bit-flips. Do we really want to throw that property away?

I think it's pretty much irrelevant for our usecase.

What the WAL checksum needs to protect against are cases like a record
spanning >1 disk sectors or >1 OS pages and one of those sectors/pages not
having made it to disk, while the rest has made it (and thus shows old
contents).

That means we have to detect runs of "wrong content" that are *never* in the
single bit range (since sector boundaries never fall within a bit), *never*
within a 4 byte range (because that's what we IIRC align records to, and
again, sector boundaries don't fall within aligned 4 byte quantities).

Because the likely causes of failure are parts of the correct record and then
a tail or an intermittent long chunk (>= 1 sector) of wrong content, detecting
certain number of bit flips just doesn't help.

Granted, but my point was, if a sector of wrong content is wrong by an
odd number of bits, the 1-bit part of the checksum will always catch
it. Every bit flip causes the popcount of the result to flip from even
to odd (or vice versa), so the odd case can never collide:

--original
select crc32c(repeat('A', 512)::bytea);
crc32c
------------
3817965270

select bit_count(b'11100011100100011000011011010110') % 2;
?column?
----------
0

--odd number of bitflips
select crc32c(('A' || repeat('C', 511))::bytea);
crc32c
-----------
113262028

select bit_count(b'110110000000011110111001100') % 2;
?column?
----------
1

--even number of bitflips
select crc32c(('A' || repeat('B', 511))::bytea);
crc32c
------------
1953030209

select bit_count(b'1110100011010001110000001000001') % 2;
?column?
----------
0

If the number of bitflips is even, than the 1-bit part will tell us
nothing, and the guarantees of the 31-bit part will not help the WAL
case for the reasons you describe. So as I understand it the trade-off
for WAL error detection is:

CRC
odd: 100%
even: the collision-avoidance probability of a mediocre hash function

good hash function:
odd: the collision-avoidance probability of a good hash function
even: the collision-avoidance probability of a good hash function

Stated this way, it's possible we don't have the best solution, but
it's also not immediately obvious to me that the second way is so much
better that it's worth the effort to change it.

If we did go to a hash function, It'd be ideal to have the collision
guarantees of an "almost universal" hash function. For any two
messages of length at most 'n', the claimed probability of collision
is at most, for example:

VHASH [1]https://arxiv.org/pdf/1503.03465: n * 2**-61
CLHASH [1]https://arxiv.org/pdf/1503.03465: 2.0004 * 2**-64 (for same length strings)
umash [2]https://github.com/backtrace-labs/umash: ceil(n / 4096) 2**-55
polymur hash [3]https://github.com/orlp/polymur-hash: n * 2**-60.2

...but these are all 64-bit hashes, and some have further traits that
make them impractical for us. I'm not aware of any 32-bit universal
hashes. If there were, the bound might be

n * 2** -(31 or less?)

...which for n=8192 and larger, is starting not to look as good. But
for a normal hash function, we only have statistical tests which are
only practical for small lengths.

It's also worth noting that just about *all* permanent storage already has
applied sector-level checksums, protecting against (and correcting) bit flips
at that level.

Sure.

but for two messages of the same length, I'm also not sure it's all
that bad, either

Our records rarely have the same length, no?

Right, I failed to consider the case where the length is in the
garbled part of the message.

[1]: https://arxiv.org/pdf/1503.03465
[2]: https://github.com/backtrace-labs/umash
[3]: https://github.com/orlp/polymur-hash

--
John Naylor
Amazon Web Services

#58Sterrett, Matthew
matthewsterrett2@gmail.com
In reply to: Devulapalli, Raghuveer (#48)
4 attachment(s)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On 12/7/2024 12:42 AM, Devulapalli, Raghuveer wrote:

[0] https://cirrus-ci.com/task/6023394207989760
[1] https://cirrus-ci.com/task/5460444254568448
[2] https://cirrus-ci.com/task/6586344161411072

I was able to fix [0] and [1], but I can't think of why [2] fails. When I tried to reproduce this locally, I get a different unrelated error. Any idea why I am seeing this?

LINK : fatal error LNK1181: cannot open input file 'C:\Program Files\Git\nologo'

Commands: meson setup build && cd build && meson compile

Hello! I'm Matthew Sterrett and I'm a coworker of Raghuveer; he asked me
to look into the Windows build failures related to pg_comp_crc32c.

It seems that the only thing that was required to fix that is to mark
pg_comp_crc32c as PGDLLIMPORT, so I added a patch that does just that.
I'm new to working with mailing lists, so please tell me if I messed
anything up!

Matthew Sterrett

Attachments:

v10-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmark.patchtext/plain; charset=UTF-8; name=v10-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmark.patchDownload
From 74d085d44d41af8ffb01f7bf2377ac487c7d4cc1 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v10 1/4] Add a Postgres SQL function for crc32c benchmarking.

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/test/modules/meson.build                  |  1 +
 src/test/modules/test_crc32c/Makefile         | 20 ++++++++
 src/test/modules/test_crc32c/meson.build      | 22 +++++++++
 .../modules/test_crc32c/test_crc32c--1.0.sql  |  1 +
 src/test/modules/test_crc32c/test_crc32c.c    | 47 +++++++++++++++++++
 .../modules/test_crc32c/test_crc32c.control   |  4 ++
 6 files changed, 95 insertions(+)
 create mode 100644 src/test/modules/test_crc32c/Makefile
 create mode 100644 src/test/modules/test_crc32c/meson.build
 create mode 100644 src/test/modules/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.c
 create mode 100644 src/test/modules/test_crc32c/test_crc32c.control

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b61953..68d8904dd0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -15,6 +15,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_crc32c')
 subdir('test_ddl_deparse')
 subdir('test_dsa')
 subdir('test_dsm_registry')
diff --git a/src/test/modules/test_crc32c/Makefile b/src/test/modules/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/src/test/modules/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_crc32c/meson.build b/src/test/modules/test_crc32c/meson.build
new file mode 100644
index 0000000000..7021a6d6cf
--- /dev/null
+++ b/src/test/modules/test_crc32c/meson.build
@@ -0,0 +1,22 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_crc32c
+
+test_install_data += files(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+)
diff --git a/src/test/modules/test_crc32c/test_crc32c--1.0.sql b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..32f8f0fb2e
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'test_crc32c.so' LANGUAGE C;
diff --git a/src/test/modules/test_crc32c/test_crc32c.c b/src/test/modules/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/src/test/modules/test_crc32c/test_crc32c.control b/src/test/modules/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/src/test/modules/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.34.1

v10-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-chec.patchtext/plain; charset=UTF-8; name=v10-0002-Refactor-consolidate-x86-ISA-and-OS-runtime-chec.patchDownload
From 2542c6830d98e146d79844fb84fe3fb1b2945c25 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 23 Jul 2024 11:23:23 -0700
Subject: [PATCH v10 2/4] Refactor: consolidate x86 ISA and OS runtime checks

Move all x86 ISA and OS runtime checks into a single file for improved
modularity and easier future maintenance.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
---
 src/include/port/pg_bitutils.h      |   1 -
 src/include/port/pg_hw_feat_check.h |  33 ++++++
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   3 +
 src/port/pg_bitutils.c              |  22 +---
 src/port/pg_crc32c_sse42_choose.c   |  21 +---
 src/port/pg_hw_feat_check.c         | 163 ++++++++++++++++++++++++++++
 src/port/pg_popcount_avx512.c       |  78 -------------
 8 files changed, 205 insertions(+), 117 deletions(-)
 create mode 100644 src/include/port/pg_hw_feat_check.h
 create mode 100644 src/port/pg_hw_feat_check.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a3cad46afe..461c7c13cf 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -312,7 +312,6 @@ extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int
  * files.
  */
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-extern bool pg_popcount_avx512_available(void);
 extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
new file mode 100644
index 0000000000..58be900b54
--- /dev/null
+++ b/src/include/port/pg_hw_feat_check.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.h
+ *	  Miscellaneous functions for cheing for hardware features at runtime.
+ *
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/port/pg_hw_feat_check.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_HW_FEAT_CHECK_H
+#define PG_HW_FEAT_CHECK_H
+
+/*
+ * Test to see if all hardware features required by SSE 4.2 crc32c (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_sse42_available(void);
+
+/*
+ * Test to see if all hardware features required by SSE 4.1 POPCNT (64 bit)
+ * are available.
+ */
+extern PGDLLIMPORT bool pg_popcount_available(void);
+
+/*
+ * Test to see if all hardware features required by AVX-512 POPCNT are
+ * available.
+ */
+extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+#endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..6088b56b71 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_hw_feat_check.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index c5bceed9cd..ec28590473 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,9 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_sse42.c',
+  'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index c8399981ee..c11b13dca2 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -20,7 +20,7 @@
 #endif
 
 #include "port/pg_bitutils.h"
-
+#include "port/pg_hw_feat_check.h"
 
 /*
  * Array giving the position of the left-most set bit for each possible
@@ -109,7 +109,6 @@ static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
 #ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
 static uint64 pg_popcount_choose(const char *buf, int bytes);
@@ -127,25 +126,6 @@ uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask)
 
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
-}
-
 /*
  * These functions get called on the first call to pg_popcount32 etc.
  * They detect whether we can use the asm implementations, and replace
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 56d600f3a9..c659917af0 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,6 +20,7 @@
 
 #include "c.h"
 
+#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 #ifdef HAVE__GET_CPUID
 #include <cpuid.h>
 #endif
@@ -29,22 +30,7 @@
 #endif
 
 #include "port/pg_crc32c.h"
-
-static bool
-pg_crc32c_sse42_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
-}
+#include "port/pg_hw_feat_check.h"
 
 /*
  * This gets called on the first call. It replaces the function pointer
@@ -61,4 +47,5 @@ pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
 	return pg_comp_crc32c(crc, data, len);
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+#endif
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
new file mode 100644
index 0000000000..260aa60502
--- /dev/null
+++ b/src/port/pg_hw_feat_check.c
@@ -0,0 +1,163 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_hw_feat_check.c
+ *		Test for hardware features at runtime on x86_64 platforms.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/port/pg_hw_feat_check.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#include <immintrin.h>
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_hw_feat_check.h"
+
+/* Define names for EXX registers to avoid hard to see bugs in code below. */
+typedef unsigned int exx_t;
+typedef enum
+{
+	EAX = 0,
+	EBX = 1,
+	ECX = 2,
+	EDX = 3
+} reg_name;
+
+/*
+ * Helper function.
+ * Test for a bit being set in a exx_t register.
+ */
+inline static bool is_bit_set_in_exx(exx_t* regs, reg_name ex, int bit)
+{
+	return ((regs[ex] & (1 << bit)) != 0);
+}
+
+/*
+ * x86_64 Platform CPUID check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuid(unsigned int leaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(leaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * x86_64 Platform CPUIDEX check for Linux and Visual Studio platforms.
+ */
+inline static void
+pg_getcpuidex(unsigned int leaf, unsigned int subleaf, exx_t *exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+	__get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+	__cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+}
+
+/*
+ * Check for CPU support for CPUID: osxsave
+ */
+inline static bool
+osxsave_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 27); /* osxsave */
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that osxsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+inline static bool
+zmm_regs_available(void)
+{
+#if defined(HAVE_XSAVE_INTRINSICS)
+	return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+	return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+inline static bool
+avx512_popcnt_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 14) && is_bit_set_in_exx(exx, EBX, 30);
+}
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+bool PGDLLIMPORT pg_popcount_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 23);
+ }
+
+ /*
+  * Returns true if the CPU supports the instructions required for the AVX-512
+  * pg_popcount() implementation.
+  *
+  * PA: The call to 'osxsave_available' MUST preceed the call to
+  *     'zmm_regs_available' function per NB above.
+  */
+bool PGDLLIMPORT pg_popcount_avx512_available(void)
+{
+	 return osxsave_available() &&
+			zmm_regs_available() &&
+			avx512_popcnt_available();
+}
+
+/*
+ * Does CPUID say there's support for SSE 4.2?
+ */
+bool PGDLLIMPORT pg_crc32c_sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+
+	return is_bit_set_in_exx(exx, ECX, 20);
+}
+
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index c8a4f2b19f..1123a1a634 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,16 +14,7 @@
 
 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
 #include <immintrin.h>
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
 #include "port/pg_bitutils.h"
 
 /*
@@ -33,75 +24,6 @@
  */
 #ifdef TRY_POPCNT_FAST
 
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
-	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
-	__cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 27)) != 0;	/* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-#ifdef HAVE_XSAVE_INTRINSICS
-pg_attribute_target("xsave")
-#endif
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
-	return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
-	return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
-	unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
-	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
-	__cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
-	return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
-		(exx[1] & (1 << 30)) != 0;	/* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
-	return xsave_available() &&
-		zmm_regs_available() &&
-		avx512_popcnt_available();
-}
-
 /*
  * pg_popcount_avx512
  *		Returns the number of 1-bits in buf
-- 
2.34.1

v10-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-chec.patchtext/plain; charset=UTF-8; name=v10-0003-Add-AVX-512-CRC32C-algorithm-with-a-runtime-chec.patchDownload
From f08e15c0834616c636d1cb949ed140926265847e Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Date: Thu, 21 Nov 2024 12:42:09 -0800
Subject: [PATCH v10 3/4] Add AVX-512 CRC32C algorithm with a runtime check

Adds pg_crc32c_avx512(): compute the crc32c of the buffer, where the
buffer length must be at least 256, and a multiple of 64. Based on:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009"

Benchmark numbers to compare against the SSE4.2 CRC32C algorithm was
generated by using the drive_crc32c() function added in
src/test/modules/test_crc32c/test_crc32c.c.

+------------------+----------------+----------------+------------------+-------+------+
| Rate in bytes/us |    SDP (SPR)   |       m6i      |       m7i        |       |      |
+------------------+----------------+----------------+------------------+ Multi-|      |
| higher is better | SSE42  | AVX512 | SSE42 | AVX512 | SSE42  | AVX512 | plier |  %   |
+==================+=================+=======+========+========+========+=======+======+
| AVG Rate 64-8192 | 10,095 | 82,101 | 8,591 | 38,652 | 11,867 | 83,194 | 6.68  | 568% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+
| AVG Rate 64-255  |  9,034 |  9,136 | 7,619 |  7,437 |  9,030 |  9,293 | 1.01  |   1% |
+------------------+--------+--------+-------+--------+--------+--------+-------+------+

Co-authored-by: Paul Amonson <paul.d.amonson@intel.com>
---
 config/c-compiler.m4                |  32 +++++
 configure                           | 154 ++++++++++++---------
 configure.ac                        | 107 +++++++--------
 meson.build                         |  23 ++++
 src/include/pg_config.h.in          |   3 +
 src/include/pg_cpu.h                |  23 ++++
 src/include/port/pg_crc32c.h        |  55 +++-----
 src/include/port/pg_hw_feat_check.h |   6 +
 src/port/meson.build                |  10 +-
 src/port/pg_crc32c_avx512.c         | 203 ++++++++++++++++++++++++++++
 src/port/pg_crc32c_sse42.c          |   2 +
 src/port/pg_crc32c_sse42_choose.c   |  51 -------
 src/port/pg_crc32c_x86_choose.c     |  57 ++++++++
 src/port/pg_hw_feat_check.c         |  75 +++++++++-
 14 files changed, 578 insertions(+), 223 deletions(-)
 create mode 100644 src/include/pg_cpu.h
 create mode 100644 src/port/pg_crc32c_avx512.c
 delete mode 100644 src/port/pg_crc32c_sse42_choose.c
 create mode 100644 src/port/pg_crc32c_x86_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index e112fd45d4..e08de01739 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -578,6 +578,38 @@ undefine([Ac_cachevar])dnl
 ])# PGAC_SSE42_CRC32_INTRINSICS
 
 
+# PGAC_AVX512_CRC32_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86 CRC instructions added in AVX-512,
+# using intrinsics with function __attribute__((target("..."))):
+
+AC_DEFUN([PGAC_AVX512_CRC32_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_crc32_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_clmulepi64_epi128 with function attribute], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+    #include <stdint.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+    }],
+  [return crc32_avx512_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_CRC32_INTRINSICS
+
+
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
 # Check if the compiler supports the CRC32C instructions using the __crc32cb,
diff --git a/configure b/configure
index 518c33b73a..b03b928bfd 100755
--- a/configure
+++ b/configure
@@ -17159,7 +17159,7 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
 $as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32... " >&6; }
@@ -17203,6 +17203,52 @@ if test x"$pgac_cv_sse42_crc32_intrinsics" = x"yes"; then
 fi
 
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_clmulepi64_epi128 with function attribute" >&5
+$as_echo_n "checking for _mm512_clmulepi64_epi128 with function attribute... " >&6; }
+if ${pgac_cv_avx512_crc32_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+    #include <stdint.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("avx512vl,vpclmulqdq")))
+    #endif
+    static int crc32_avx512_test(void)
+    {
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+    }
+int
+main ()
+{
+return crc32_avx512_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx512_crc32_intrinsics=yes
+else
+  pgac_cv_avx512_crc32_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_crc32_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_crc32_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_crc32_intrinsics" = x"yes"; then
+  pgac_avx512_crc32_intrinsics=yes
+fi
+
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
@@ -17404,9 +17450,8 @@ fi
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -17423,95 +17468,80 @@ fi
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
-    else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
-        fi
-      fi
-    fi
-  fi
-fi
 
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
 $as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
 
 $as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
 
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C baseline feature SSE 4.2" >&5
+$as_echo "CRC32C baseline feature SSE 4.2" >&6; }
+    else
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
 
 $as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
-$as_echo "SSE 4.2 with runtime check" >&6; }
-  else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C SSE42 with runtime check" >&5
+$as_echo "CRC32C SSE42 with runtime check" >&6; }
+        fi
+    fi
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+
+$as_echo "#define USE_AVX512_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: CRC32C AVX-512 with runtime check" >&5
+$as_echo "CRC32C AVX-512 with runtime check" >&6; }
+    fi
+else
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
 
 $as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
 
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
 $as_echo "ARMv8 CRC instructions" >&6; }
-    else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+  else
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
 $as_echo "ARMv8 CRC instructions with runtime check" >&6; }
-      else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+    else
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
 
 $as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
 $as_echo "LoongArch CRCC instructions" >&6; }
-        else
+      else
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
 
 $as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
 
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
 $as_echo "slicing-by-8" >&6; }
-        fi
       fi
     fi
   fi
 fi
 
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/configure.ac b/configure.ac
index 247ae97fa4..96a9c2db1f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,10 +2021,14 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 and AVX-512 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
 
+# Check if the _mm512_clmulepi64_epi128 and _mm_xor_epi64 can be used with with
+# the __attribute__((target("avx512vl,vpclmulqdq"))).
+PGAC_AVX512_CRC32_INTRINSICS([])
+
 # Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
 # define __SSE4_2__ in that case.
 AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
@@ -2060,9 +2064,8 @@ AC_SUBST(CFLAGS_CRC)
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
 # use the special CRC instructions for calculating CRC-32C. If we're not
 # targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, compile both implementations and select which one to use
-# at runtime, depending on whether SSE 4.2 is supported by the processor we're
-# running on.
+# the SSE/AVX-512 intrinsics compile both implementations and select which one
+# to use at runtime, depending runtime cpuid information.
 #
 # Similarly, if we are targeting an ARM processor that has the CRC
 # instructions that are part of the ARMv8 CRC Extension, use them. And if
@@ -2079,76 +2082,58 @@ AC_SUBST(CFLAGS_CRC)
 #
 # If we are targeting a LoongArch processor, CRC instructions are
 # always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
-  # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
-    USE_SSE42_CRC32C=1
-  else
-    # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
-    # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-      USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+
+AC_MSG_CHECKING([which CRC-32C implementation to use])
+if test x"$host_cpu" = x"x86_64"; then
+    #x86 only:
+    PG_CRC32C_OBJS="pg_crc32c_sb8.o pg_crc32c_x86_choose.o"
+    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+      AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
+      PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+      AC_MSG_RESULT(CRC32C baseline feature SSE 4.2)
     else
-      # Use ARM CRC Extension if available.
-      if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
-        USE_ARMV8_CRC32C=1
-      else
-        # ARM CRC Extension, with runtime check?
-        if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
-          USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
-        else
-          # LoongArch CRCC instructions.
-          if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
-            USE_LOONGARCH_CRC32C=1
-          else
-            # fall back to slicing-by-8 algorithm, which doesn't require any
-            # special CPU support.
-            USE_SLICING_BY_8_CRC32C=1
-          fi
+        if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+          AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+          PG_CRC32C_OBJS+=" pg_crc32c_sse42.o"
+          AC_MSG_RESULT(CRC32C SSE42 with runtime check)
         fi
-      fi
     fi
-  fi
-fi
-
-# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
-AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-  AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
-  PG_CRC32C_OBJS="pg_crc32c_sse42.o"
-  AC_MSG_RESULT(SSE 4.2)
+    if test x"$pgac_avx512_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+      AC_DEFINE(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel AVX 512 CRC instructions with a runtime check.])
+      PG_CRC32C_OBJS+=" pg_crc32c_avx512.o"
+      AC_MSG_RESULT(CRC32C AVX-512 with runtime check)
+    fi
 else
-  if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-    AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
-    PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
-    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  # non x86 code:
+  # Use ARM CRC Extension if available.
+  if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
+    AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
+    PG_CRC32C_OBJS="pg_crc32c_armv8.o"
+    AC_MSG_RESULT(ARMv8 CRC instructions)
   else
-    if test x"$USE_ARMV8_CRC32C" = x"1"; then
-      AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
-      PG_CRC32C_OBJS="pg_crc32c_armv8.o"
-      AC_MSG_RESULT(ARMv8 CRC instructions)
+    # ARM CRC Extension, with runtime check?
+    if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+      AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+      PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+      AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
     else
-      if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
-        AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
-        PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
-        AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
+      # LoongArch CRCC instructions.
+      if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+        AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+        PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+        AC_MSG_RESULT(LoongArch CRCC instructions)
       else
-        if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
-          AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
-          PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
-          AC_MSG_RESULT(LoongArch CRCC instructions)
-        else
-          AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
-          PG_CRC32C_OBJS="pg_crc32c_sb8.o"
-          AC_MSG_RESULT(slicing-by-8)
-        fi
+        # fall back to slicing-by-8 algorithm, which doesn't require any
+        # special CPU support.
+        AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+        PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+        AC_MSG_RESULT(slicing-by-8)
       fi
     fi
   fi
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
-
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
   if test x"$PREFERRED_SEMAPHORES" = x"NAMED_POSIX" ; then
diff --git a/meson.build b/meson.build
index e5ce437a5c..5833661d71 100644
--- a/meson.build
+++ b/meson.build
@@ -2222,6 +2222,23 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
     have_optimized_crc = true
   else
 
+    avx512_crc_prog = '''
+#include <immintrin.h>
+#include <stdint.h>
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vl,vpclmulqdq")))
+#endif
+int main(void)
+{
+      __m512i x0 = _mm512_set1_epi32(0x1);
+      __m512i x1 = _mm512_set1_epi32(0x2);
+      __m512i x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // vpclmulqdq
+      __m128i a1 = _mm_xor_epi64(_mm512_castsi512_si128(x1), _mm512_castsi512_si128(x0)); //avx512vl
+      int64_t val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0)); // 64-bit instruction
+      return (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+}
+'''
+
     prog = '''
 #include <nmmintrin.h>
 
@@ -2252,6 +2269,12 @@ int main(void)
       cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
       have_optimized_crc = true
     endif
+    if cc.links(avx512_crc_prog,
+        name: 'AVX512 CRC32C with function attributes',
+        args: test_c_args)
+      cdata.set('USE_AVX512_CRC32C_WITH_RUNTIME_CHECK', 1)
+      have_optimized_crc = true
+    endif
 
   endif
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..db40e6476d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -697,6 +697,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use Intel AVX-512 CRC instructions with a runtime check. */
+#undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/pg_cpu.h b/src/include/pg_cpu.h
new file mode 100644
index 0000000000..223994cb0d
--- /dev/null
+++ b/src/include/pg_cpu.h
@@ -0,0 +1,23 @@
+/*
+ * pg_cpu.h
+ *      Useful macros to determine CPU types
+ */
+
+#ifndef PG_CPU_H_
+#define PG_CPU_H_
+#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
+    /*
+     * __i386__ is defined by gcc and Intel compiler on Linux,
+     * _M_IX86 by VS compiler,
+     * i386 by Sun compilers on opensolaris at least
+     */
+    #define PG_CPU_X86
+#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
+    /*
+     * both __x86_64__ and __amd64__ are defined by gcc
+     * __x86_64 defined by sun compiler on opensolaris at least
+     * _M_AMD64 defined by MS compiler
+     */
+    #define PG_CPU_x86_64
+#endif
+#endif // PG_CPU_H_
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 63c8e3a00b..690273506b 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -34,58 +34,43 @@
 #define PG_CRC32C_H
 
 #include "port/pg_bswap.h"
+#include "pg_cpu.h"
 
 typedef uint32 pg_crc32c;
 
 /* The INIT and EQ macros are the same for all implementations. */
 #define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
-
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* x86 */
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* ARMV8 */
 #elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
+extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 
+/* ARMV8 with runtime check */
+#elif defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
+extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+#define COMP_CRC32C(crc, data, len) \
+	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
+/* LoongArch */
 #elif defined(USE_LOONGARCH_CRC32C)
-/* Use LoongArch CRCC instructions. */
-
+extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len)							\
 	((crc) = pg_comp_crc32c_loongarch((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK) || defined(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK)
-
-/*
- * Use Intel SSE 4.2 or ARMv8 instructions, but perform a runtime check first
- * to check that they are available.
- */
-#define COMP_CRC32C(crc, data, len) \
-	((crc) = pg_comp_crc32c((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
-
-#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-#endif
-#ifdef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-#endif
 
 #else
 /*
@@ -98,13 +83,11 @@ extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t le
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
+#undef FIN_CRC32C
 #define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
-#else
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
 #endif
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-
 #endif
 
 #endif							/* PG_CRC32C_H */
diff --git a/src/include/port/pg_hw_feat_check.h b/src/include/port/pg_hw_feat_check.h
index 58be900b54..3a73014987 100644
--- a/src/include/port/pg_hw_feat_check.h
+++ b/src/include/port/pg_hw_feat_check.h
@@ -30,4 +30,10 @@ extern PGDLLIMPORT bool pg_popcount_available(void);
  * available.
  */
 extern PGDLLIMPORT bool pg_popcount_avx512_available(void);
+
+/*
+ * Test to see if all hardware features required by the AVX-512 SIMD
+ * algorithm are available.
+ */
+extern PGDLLIMPORT bool pg_crc32c_avx512_available(void);
 #endif							/* PG_HW_FEAT_CHECK_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index ec28590473..0ba4a56194 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,8 +8,10 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
-  'pg_crc32c_sse42_choose.c',
+  'pg_crc32c_x86_choose.c',
+  'pg_crc32c_avx512.c',
   'pg_crc32c_sse42.c',
+  'pg_crc32c_sb8.c',
   'pg_hw_feat_check.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
@@ -83,12 +85,6 @@ endif
 # Replacement functionality to be built if corresponding configure symbol
 # is true
 replace_funcs_pos = [
-  # x86/x64
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
-  ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-  ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
-
   # arm / aarch64
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
diff --git a/src/port/pg_crc32c_avx512.c b/src/port/pg_crc32c_avx512.c
new file mode 100644
index 0000000000..ba4defcefd
--- /dev/null
+++ b/src/port/pg_crc32c_avx512.c
@@ -0,0 +1,203 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_avx512.c
+ *	  Compute CRC-32C checksum using Intel AVX-512 instructions.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#if defined(USE_AVX512_CRC32C_WITH_RUNTIME_CHECK)
+
+#include <immintrin.h>
+
+#include "port/pg_crc32c.h"
+
+
+/*******************************************************************
+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based
+ * on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009
+ *
+ * For This Function:
+ * Copyright 2015 The Chromium Authors
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *    * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *    * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *    * Neither the name of Google LLC nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("avx512vl,vpclmulqdq")
+inline pg_crc32c
+pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t length)
+{
+	static const uint64 k1k2[8] = {
+		0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4,
+		0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+	static const uint64 k3k4[8] = {
+		0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02,
+		0x9e4addf8, 0x740eef02, 0x9e4addf8};
+	static const uint64 k9k10[8] = {
+		0x6992cea2, 0x0d3b6092, 0x6992cea2, 0x0d3b6092, 0x6992cea2,
+		0x0d3b6092, 0x6992cea2, 0x0d3b6092};
+	static const uint64 k1k4[8] = {
+		0x1c291d04, 0xddc0152b, 0x3da6d0cb, 0xba4fc28e, 0xf20c0dfe,
+		0x493c7d27, 0x00000000, 0x00000000};
+
+	const uint8 *input = (const uint8 *)data;
+	if (length >= 256)
+	{
+		uint64 val;
+		__m512i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+		__m128i a1, a2;
+
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * >>> BEGIN
+		 */
+
+		/*
+		* There's at least one block of 256.
+		*/
+		x1 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+		x2 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+		x3 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+		x4 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+		x1 = _mm512_xor_si512(x1, _mm512_castsi128_si512(_mm_cvtsi32_si128(crc)));
+
+		x0 = _mm512_load_si512((__m512i *)k1k2);
+
+		input += 256;
+		length -= 256;
+
+		/*
+		* Parallel fold blocks of 256, if any.
+		*/
+		while (length >= 256)
+		{
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x6 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+			x7 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+			x8 = _mm512_clmulepi64_epi128(x4, x0, 0x00);
+
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x2 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+			x3 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+			x4 = _mm512_clmulepi64_epi128(x4, x0, 0x11);
+
+			y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+			y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+			y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+			y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));
+
+			x1 = _mm512_ternarylogic_epi64(x1, x5, y5, 0x96);
+			x2 = _mm512_ternarylogic_epi64(x2, x6, y6, 0x96);
+			x3 = _mm512_ternarylogic_epi64(x3, x7, y7, 0x96);
+			x4 = _mm512_ternarylogic_epi64(x4, x8, y8, 0x96);
+
+			input += 256;
+			length -= 256;
+				}
+
+		/*
+		 * Fold 256 bytes into 64 bytes.
+		 */
+		x0 = _mm512_load_si512((__m512i *)k9k10);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x6 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x3 = _mm512_ternarylogic_epi64(x3, x5, x6, 0x96);
+
+		x7 = _mm512_clmulepi64_epi128(x2, x0, 0x00);
+		x8 = _mm512_clmulepi64_epi128(x2, x0, 0x11);
+		x4 = _mm512_ternarylogic_epi64(x4, x7, x8, 0x96);
+
+		x0 = _mm512_load_si512((__m512i *)k3k4);
+		y5 = _mm512_clmulepi64_epi128(x3, x0, 0x00);
+		y6 = _mm512_clmulepi64_epi128(x3, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x4, y5, y6, 0x96);
+
+		/*
+		 * Single fold blocks of 64, if any.
+		 */
+		while (length >= 64)
+		{
+			x2 = _mm512_loadu_si512((__m512i *)input);
+
+			x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+			x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+			x1 = _mm512_ternarylogic_epi64(x1, x2, x5, 0x96);
+
+			input += 64;
+			length -= 64;
+		}
+
+		/*
+		 * Fold 512-bits to 128-bits.
+		 */
+		x0 = _mm512_loadu_si512((__m512i *)k1k4);
+
+		a2 = _mm512_extracti32x4_epi32(x1, 3);
+		x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+		x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+		x1 = _mm512_ternarylogic_epi64(x1, x5, _mm512_castsi128_si512(a2), 0x96);
+
+		x0 = _mm512_shuffle_i64x2(x1, x1, 0x4E);
+		x0 = _mm512_xor_epi64(x1, x0);
+		a1 = _mm512_extracti32x4_epi32(x0, 1);
+		a1 = _mm_xor_epi64(a1, _mm512_castsi512_si128(x0));
+
+		/*
+		 * Fold 128-bits to 32-bits.
+		 */
+		val = _mm_crc32_u64(0, _mm_extract_epi64(a1, 0));
+		crc = (uint32_t)_mm_crc32_u64(val, _mm_extract_epi64(a1, 1));
+		/*
+		 * AVX-512 Optimized crc32c algorithm with mimimum of 256 bytes aligned
+		 * to 32 bytes.
+		 * <<< END
+		 ******************************************************************/
+	}
+
+	/*
+	 * Finish any remaining bytes with legacy AVX algorithm.
+	 */
+	return pg_comp_crc32c_sse42(crc, input, length);
+}
+#endif // AVX512_CRC32
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index dcc4904a82..90d155e804 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -14,6 +14,7 @@
  */
 #include "c.h"
 
+#if defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
 #include <nmmintrin.h>
 
 #include "port/pg_crc32c.h"
@@ -68,3 +69,4 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+#endif
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
deleted file mode 100644
index c659917af0..0000000000
--- a/src/port/pg_crc32c_sse42_choose.c
+++ /dev/null
@@ -1,51 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_crc32c_sse42_choose.c
- *	  Choose between Intel SSE 4.2 and software CRC-32C implementation.
- *
- * On first call, checks if the CPU we're running on supports Intel SSE
- * 4.2. If it does, use the special SSE instructions for CRC-32C
- * computation. Otherwise, fall back to the pure software implementation
- * (slicing-by-8).
- *
- * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/port/pg_crc32c_sse42_choose.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "c.h"
-
-#if defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
-#include "port/pg_crc32c.h"
-#include "port/pg_hw_feat_check.h"
-
-/*
- * This gets called on the first call. It replaces the function pointer
- * so that subsequent calls are routed directly to the chosen implementation.
- */
-static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
-{
-	if (pg_crc32c_sse42_available())
-		pg_comp_crc32c = pg_comp_crc32c_sse42;
-	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
-}
-
-pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
-#endif
diff --git a/src/port/pg_crc32c_x86_choose.c b/src/port/pg_crc32c_x86_choose.c
new file mode 100644
index 0000000000..3ce8be11a6
--- /dev/null
+++ b/src/port/pg_crc32c_x86_choose.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_crc32c_x86_choose.c
+ *	  Choose between Intel AVX-512, SSE 4.2 and software CRC-32C implementation.
+ *
+ * On first call, checks if the CPU we're running on supports Intel AVX-512. If
+ * it does, use the special SSE instructions for CRC-32C computation.
+ * Otherwise, fall back to the pure software implementation (slicing-by-8).
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_crc32c_x86_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
+
+#include "port/pg_crc32c.h"
+#include "port/pg_hw_feat_check.h"
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ * (1) set pg_comp_crc32c pointer and (2) return the computed crc value
+ */
+static pg_crc32c
+pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+{
+#ifdef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK
+	if (pg_crc32c_avx512_available()) {
+		pg_comp_crc32c = pg_comp_crc32c_avx512;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+#ifdef USE_SSE42_CRC32C
+        pg_comp_crc32c = pg_comp_crc32c_sse42;
+        return pg_comp_crc32c(crc, data, len);
+#elif USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
+        if (pg_crc32c_sse42_available()) {
+                pg_comp_crc32c = pg_comp_crc32c_sse42;
+                return pg_comp_crc32c(crc, data, len);
+        }
+#endif
+        pg_comp_crc32c = pg_comp_crc32c_sb8;
+        return pg_comp_crc32c(crc, data, len);
+}
+
+pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+
+#endif // x86/x86_64
diff --git a/src/port/pg_hw_feat_check.c b/src/port/pg_hw_feat_check.c
index 260aa60502..b2872fa708 100644
--- a/src/port/pg_hw_feat_check.c
+++ b/src/port/pg_hw_feat_check.c
@@ -11,6 +11,9 @@
  *-------------------------------------------------------------------------
  */
 #include "c.h"
+#include "pg_cpu.h"
+
+#if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
 
 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
 #include <cpuid.h>
@@ -135,9 +138,60 @@ bool PGDLLIMPORT pg_popcount_available(void)
 	return is_bit_set_in_exx(exx, ECX, 23);
  }
 
+/*
+ * Check for CPU supprt for CPUIDEX: avx512-f
+ */
+inline static bool
+avx512f_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 16); /* avx512-f */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+vpclmulqdq_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, ECX, 10); /* vpclmulqdq */
+}
+
+/*
+ * Check for CPU supprt for CPUIDEX: vpclmulqdq
+ */
+inline static bool
+avx512vl_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuidex(7, 0, exx);
+	return is_bit_set_in_exx(exx, EBX, 31); /* avx512-vl */
+}
+
+/*
+ * Check for CPU supprt for CPUID: sse4.2
+ */
+inline static bool
+sse42_available(void)
+{
+	exx_t exx[4] = {0, 0, 0, 0};
+
+	pg_getcpuid(1, exx);
+	return is_bit_set_in_exx(exx, ECX, 20); /* sse4.2 */
+}
+
+/****************************************************************************/
+/*                               Public API                                 */
+/****************************************************************************/
  /*
-  * Returns true if the CPU supports the instructions required for the AVX-512
-  * pg_popcount() implementation.
+  * Returns true if the CPU supports the instructions required for the
+  * AVX-512 pg_popcount() implementation.
   *
   * PA: The call to 'osxsave_available' MUST preceed the call to
   *     'zmm_regs_available' function per NB above.
@@ -154,10 +208,19 @@ bool PGDLLIMPORT pg_popcount_avx512_available(void)
  */
 bool PGDLLIMPORT pg_crc32c_sse42_available(void)
 {
-	exx_t exx[4] = {0, 0, 0, 0};
-
-	pg_getcpuid(1, exx);
+	return sse42_available();
+}
 
-	return is_bit_set_in_exx(exx, ECX, 20);
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_crc32c implementation.
+ */
+bool PGDLLIMPORT
+pg_crc32c_avx512_available(void)
+{
+	return sse42_available() && osxsave_available() &&
+		   avx512f_available() && vpclmulqdq_available() &&
+		   avx512vl_available() && zmm_regs_available();
 }
 
+#endif // #if defined(PG_CPU_X86) || defined(PG_CPU_x86_64)
-- 
2.34.1

v10-0004-Mark-pg_comp_crc32c-as-PGDLLIMPORT-for-Windows-b.patchtext/plain; charset=UTF-8; name=v10-0004-Mark-pg_comp_crc32c-as-PGDLLIMPORT-for-Windows-b.patchDownload
From 6e8f557c857772b0c22607866d1b8930a67df05e Mon Sep 17 00:00:00 2001
From: Matthew Sterrett <matthew.sterrett@intel.com>
Date: Wed, 18 Dec 2024 14:11:33 -0800
Subject: [PATCH v10 4/4] Mark pg_comp_crc32c as PGDLLIMPORT for Windows build

---
 src/include/port/pg_crc32c.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index 690273506b..534d07dd5d 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -48,7 +48,7 @@ typedef uint32 pg_crc32c;
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
 extern pg_crc32c pg_comp_crc32c_avx512(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+extern PGDLLIMPORT pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c((crc), (data), (len)))
 
-- 
2.34.1

#59Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Sterrett, Matthew (#58)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hello! I'm Matthew Sterrett and I'm a coworker of Raghuveer; he asked me to
look into the Windows build failures related to pg_comp_crc32c.

It seems that the only thing that was required to fix that is to mark
pg_comp_crc32c as PGDLLIMPORT, so I added a patch that does just that.
I'm new to working with mailing lists, so please tell me if I messed anything up!

Thanks Matthew for fixing the windows CI failure. Looks like the CI all pass https://cirrus-ci.com/build/5105570367143936 with v10. Is there any additional feedback for this patch?

Raghuveer

#60John Naylor
johncnaylorls@gmail.com
In reply to: Devulapalli, Raghuveer (#59)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Wed, Jan 22, 2025 at 12:46 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

Is there any additional feedback for this patch?

Hi Raghuveer,

I raised one question and one concern upthread. I will repeat them
here for convenience.

#1 - The choice of AVX-512. There is no such thing as a "CRC
instruction operating on 8 bytes", and the proposed algorithm is a
multistep process using carryless multiplication and requiring at
least 256 bytes of input. The Chromium sources cited as the source for
this patch also contain an implementation using 128-bit instructions,
and which only requires at least 64 bytes of input. Is there a reason
that not tested or proposed as well? That would be much easier to
read/maintain, work on more systems, and might give a speed boost on
smaller inputs. These are useful properties to have.

https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd.c#L215

#2 - The legal status of the algorithm from following Intel white
paper, which is missing from its original location, archived here:

https://web.archive.org/web/20220802143127/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

This algorithm is the most portable and can in fact be coded with
plain C, no additional instructions. The only disadvantage is that
with pure C it's only useful on input with hundreds of bytes. But that
limitation is not that different from the AVX-512 proposal in this
regard.

My question on this paper is about this passage:

"The basic concepts in this paper are derived from and explained in detail in
the patents and pending applications [4]Determining a Message Residue, Gopal et al. United States Patent 7,886,214[5]Determining a Message Residue Gueron et al. United States Patent Application 20090019342[6]Determining a Message Residue Gopal et al. United States Patent Application 20090158132."
...
[4]: Determining a Message Residue, Gopal et al. United States Patent 7,886,214
[5]: Determining a Message Residue Gueron et al. United States Patent Application 20090019342
20090019342
[6]: Determining a Message Residue Gopal et al. United States Patent Application 20090158132
20090158132

Looking at Linux kernel sources, it seems a patch using this technique
was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-asm_64.S

...so I'm unclear if these patents are applicable to software
implementations. They also seem to be expired, but I am not a lawyer.
Could you look into this please? Even if we do end up with AVX-512,
this would be a good fallback.

--
John Naylor
Amazon Web Services

#61Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: John Naylor (#60)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi John,

Thanks for your summary and here are responses:

#1 - The choice of AVX-512. There is no such thing as a "CRC instruction operating
on 8 bytes", and the proposed algorithm is a multistep process using carryless
multiplication and requiring at least 256 bytes of input. The Chromium sources
cited as the source for this patch also contain an implementation using 128-bit
instructions, and which only requires at least 64 bytes of input. Is there a reason
that not tested or proposed as well? That would be much easier to read/maintain,
work on more systems, and might give a speed boost on smaller inputs. These are
useful properties to have.

https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd
.c#L215

Agreed. postgres already has the SSE42 version pg_comp_crc32c_sse42, but I didn’t
realize it uses the crc32 instruction which processes only 8 bytes at a time. This can
certainly be upgraded to process 64bytes at a time and should be faster. Since most
of the AVX-512 stuff is almost ready, I propose to do this in a follow up patch immediately.
Let me know if you disagree. The AVX512 version processes 256 bytes at a time and will
most certainly be faster than the improved SSE42 version, which is why the chromium
library has both AVX512 and SSE42.

#2 - The legal status of the algorithm from following Intel white paper, which is
missing from its original location, archived here:

https://web.archive.org/web/20220802143127/https://www.intel.com/content/
dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-
instruction-paper.pdf

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
asm_64.S

...so I'm unclear if these patents are applicable to software implementations.
They also seem to be expired, but I am not a lawyer.
Could you look into this please? Even if we do end up with AVX-512, this would be
a good fallback.

Given that SSE42 is pretty much available in all x86 processors at this point, do we need a
fallback C version specially after we improve the SSE42 version.

Raghuveer

#62John Naylor
johncnaylorls@gmail.com
In reply to: Devulapalli, Raghuveer (#61)
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

On Sat, Jan 25, 2025 at 3:35 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

#1 - The choice of AVX-512. There is no such thing as a "CRC instruction operating
on 8 bytes", and the proposed algorithm is a multistep process using carryless
multiplication and requiring at least 256 bytes of input. The Chromium sources
cited as the source for this patch also contain an implementation using 128-bit
instructions, and which only requires at least 64 bytes of input. Is there a reason
that not tested or proposed as well? That would be much easier to read/maintain,
work on more systems, and might give a speed boost on smaller inputs. These are
useful properties to have.

https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd
.c#L215

Agreed. postgres already has the SSE42 version pg_comp_crc32c_sse42, but I didn’t
realize it uses the crc32 instruction which processes only 8 bytes at a time. This can
certainly be upgraded to process 64bytes at a time and should be faster. Since most
of the AVX-512 stuff is almost ready, I propose to do this in a follow up patch immediately.

It doesn't make sense to me that more limited/difficult hardware
support (and more complex coding for that) and a larger input
threshold should be a prerequisite for something that doesn't have
these disadvantages.

Let me know if you disagree. The AVX512 version processes 256 bytes at a time and will
most certainly be faster than the improved SSE42 version, which is why the chromium
library has both AVX512 and SSE42.

It looks like chromium simply vendored the zlib library. Input
destined for compression is always going to be "large". That's not
true in general for our use case, and we mentioned that fact seven
months ago, when Andres said upthread [1]/messages/by-id/20240612201135.kk77tiqcux77lgev@awork3.anarazel.de: "This is extremely workload
dependent, it's not hard to find workloads with lots of very small
record and very few big ones...". Given that feedback, it would have
made a lot of sense to mention the 64-byte alternative back then,
especially since it's the exact same pclmull algorithm based on the
same paper, and is found in the same zlib .c file, but for some reason
that was not done.

More broadly, the best strategy is to start with the customer and work
backward to the technology. It's more risky to pick the technology
upfront and try to find ways to use it. My goal here is to help you
make the right tradeoffs. Here's my view:

1. If we can have a relatively low input size threshold for
improvement, it's possibly worth a bit of additional complexity in
configure and run-time checks. There is a complicating factor in
testing that though: the latency of carryless multiplication
instructions varies drastically on different microarchitectures.
2. If we can improve large inputs in a simple fashion, with no
additional hardware support, that's worth doing in any case.
3. Complex hardware support (6 CPUIDs!) that only works on large
inputs (a minority of workloads) looks to be the worst of both worlds
and it's not the tradeoff we should make.

Further, we verified upthread that Intel's current and near-future
product line includes server chips (some with over 100 cores, so not
exactly low-end) that don't support AVX-512 at all. I have no idea how
common they will be, but they will certainly be found in cloud
datacenters somewhere. Shouldn't we have an answer for them as well?

#2 - The legal status of the algorithm from following Intel white paper, which is
missing from its original location, archived here:

https://web.archive.org/web/20220802143127/https://www.intel.com/content/
dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-
instruction-paper.pdf

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
asm_64.S

...so I'm unclear if these patents are applicable to software implementations.
They also seem to be expired, but I am not a lawyer.
Could you look into this please? Even if we do end up with AVX-512, this would be
a good fallback.

Given that SSE42 is pretty much available in all x86 processors at this point, do we need a
fallback C version specially after we improve the SSE42 version.

I know you had extended time off work, but I've already shared my
findings and explained my reasoning [2]/messages/by-id/CANWCAZbr4sO1bPoS+E=iRWnrBZp7zUKZEJk39KYt_Pu9+X1-SQ@mail.gmail.com. The title of the paper is
"Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction",
so unsurprisingly it does improve the SSE42 version. With a few dozen
lines of code, I can get ~3x speedup on page-sized inputs. At the very
least we want to use this technique on Arm [3]https://commitfest.postgresql.org/51/4620/, and the only blocker
now is the question regarding the patents. I'm interested to hear the
response on this.

[1]: /messages/by-id/20240612201135.kk77tiqcux77lgev@awork3.anarazel.de
[2]: /messages/by-id/CANWCAZbr4sO1bPoS+E=iRWnrBZp7zUKZEJk39KYt_Pu9+X1-SQ@mail.gmail.com
[3]: https://commitfest.postgresql.org/51/4620/

--
John Naylor
Amazon Web Services

#63Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: John Naylor (#62)
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

Hi John,

Further, we verified upthread that Intel's current and near-future product line
includes server chips (some with over 100 cores, so not exactly low-end) that
don't support AVX-512 at all. I have no idea how common they will be, but they
will certainly be found in cloud datacenters somewhere. Shouldn't we have an
answer for them as well?

Just submitted a patch to improve the SSE4.2 version using the source you referenced. See
/messages/by-id/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com

I know you had extended time off work, but I've already shared my findings and
explained my reasoning [2]. The title of the paper is "Fast CRC Computation for
iSCSI Polynomial Using CRC32 Instruction", so unsurprisingly it does improve the
SSE42 version. With a few dozen lines of code, I can get ~3x speedup on page-
sized inputs. At the very least we want to use this technique on Arm [3], and the
only blocker now is the question regarding the patents. I'm interested to hear the
response on this.

Still figuring this out. Will respond as soon as I can.

Thanks,
Raghuveer