always use runtime checks for CRC-32C instructions
This is an offshoot of the "CRC32C Parallel Computation Optimization on
ARM" thread [0]/messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com. I intend for this to be a prerequisite patch set.
Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C
code for WAL records, etc., we first check if the intrinsics are available
with the default compiler flags. If so, we only bother compiling the
implementation that uses those intrinsics. If not, we also check whether
the intrinsics are available with some extra CFLAGS, and if they are, we
compile both the implementation that uses the intrinsics as well as a
fallback implementation that doesn't require any special instructions.
Then, at runtime, we check what's available in the hardware and choose the
appropriate CRC32C implementation.
The aforementioned other thread [0]/messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks. $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification. Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.
AFAICT the trade-offs aren't too bad. With some simple testing, I see that
the runtime check occurs once at startup, so I don't anticipate any
noticeable performance impact. I suppose each process might need to do the
check in EXEC_BACKEND builds, but even so, I suspect the difference is
negligible.
I also see that the SSE 4.2 runtime check requires the CPUID instruction,
so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but
not CPUID. However, I'm not sure such hardware even exists. Wikipedia
says that CPUID was introduced in 1993 [1]https://en.wikipedia.org/wiki/CPUID, and meson.build appears to omit
the CPUID check when determining which CRC32C implementation to use.
Furthermore, meson.build alludes to problems with some of the CPUID-related
checks:
# XXX: The configure.ac check for __cpuid() is broken, we don't copy that
# here. To prevent problems due to two detection methods working, stop
# checking after one.
Are there any other reasons that we should try to avoid the runtime check
when possible?
I've attached two patches. 0001 adds a debug message to the SSE 4.2
runtime check that matches the one already present for the ARMv8 check.
This message just notes whether the runtime check found that the special
CRC instructions are available. 0002 is a first attempt at $SUBJECT. I've
tested it on both x86 and ARM, and it seems to work as intended. You'll
notice that I'm still checking for the intrinsics with the default compiler
flags first. I didn't see any strong reason to change this, and doing so
allows us to avoid sending extra CFLAGS when possible.
Thoughts?
[0]: /messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com
[1]: https://en.wikipedia.org/wiki/CPUID
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-add-debug-message.patchtext/x-diff; charset=us-asciiDownload
From 964afc75976cce8d712d97f346d2b8eea9c1f1ee Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 28 Oct 2023 22:12:45 -0500
Subject: [PATCH v1 1/2] add debug message
---
src/port/pg_crc32c_sse42_choose.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 41ff4a35ad..3689c38e92 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -30,10 +30,15 @@
#include "port/pg_crc32c.h"
+#ifndef FRONTEND
+#include "utils/elog.h"
+#endif
+
static bool
pg_crc32c_sse42_available(void)
{
unsigned int exx[4] = {0, 0, 0, 0};
+ bool result;
#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
@@ -43,7 +48,13 @@ pg_crc32c_sse42_available(void)
#error cpuid instruction not available
#endif
- return (exx[2] & (1 << 20)) != 0; /* SSE 4.2 */
+ result = ((exx[2] & (1 << 20)) != 0); /* SSE 4.2 */
+
+#ifndef FRONTEND
+ elog(DEBUG1, "using sse42 crc32 hardware = %d", result);
+#endif
+
+ return result;
}
/*
--
2.37.1 (Apple Git-137.1)
v1-0002-always-use-runtime-checks-for-sse4.2-armv8-crc32c.patchtext/x-diff; charset=us-asciiDownload
From 7b8efae00327728000f7650a513ab0fd4fb15cd5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 28 Oct 2023 21:16:19 -0500
Subject: [PATCH v1 2/2] always use runtime checks for sse4.2/armv8 crc32c code
---
configure | 122 ++++++++++-------------------------
configure.ac | 104 ++++++++++-------------------
meson.build | 31 +++------
src/include/pg_config.h.in | 6 --
src/include/port/pg_crc32c.h | 19 +-----
src/port/meson.build | 2 -
src/tools/msvc/Solution.pm | 2 -
7 files changed, 80 insertions(+), 206 deletions(-)
diff --git a/configure b/configure
index cfd968235f..47372dcd18 100755
--- a/configure
+++ b/configure
@@ -17885,28 +17885,6 @@ fi
fi
-# Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
-# define __SSE4_2__ in that case.
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-
-int
-main ()
-{
-
-#ifndef __SSE4_2__
-#error __SSE4_2__ not defined
-#endif
-
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_compile "$LINENO"; then :
- SSE4_2_TARGETED=1
-fi
-rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
-
# Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
#
# First check if __crc32c* intrinsics can be used with the default compiler
@@ -18040,50 +18018,36 @@ fi
# Select CRC-32C implementation.
#
-# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
-# use the special CRC instructions for calculating CRC-32C. If we're not
-# targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# If we are targeting a processor that has Intel SSE 4.2 instructions, or we're
+# not targeting such a processor, but we can nevertheless produce code that
+# uses the SSE intrinsics, perhaps with some extra CFLAGS, compile both
# implementations and select which one to use at runtime, depending on whether
# SSE 4.2 is supported by the processor we're running on.
#
# Similarly, if we are targeting an ARM processor that has the CRC
-# instructions that are part of the ARMv8 CRC Extension, use them. And if
-# we're not targeting such a processor, but can nevertheless produce code that
-# uses the CRC instructions, compile both, and select at runtime.
-#
-# You can skip the runtime check by setting the appropriate USE_*_CRC32 flag to 1
-# in the template or configure command line.
+# instructions that are part of the ARMv8 CRC Extension, or if we're not
+# targeting such a processor, but can nevertheless produce code that uses the
+# CRC instructions, compile both, and select at runtime.
#
# If we are targeting a LoongArch processor, CRC instructions are
# always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
- # Use Intel SSE 4.2 if available.
- if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
- USE_SSE42_CRC32C=1
- else
- # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
- # the runtime check.
- if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+ # Use Intel SSE 4.2 if available, with runtime check. The CPUID instruction
+ # is needed for the runtime check.
+ if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+ else
+ # Use ARM CRC Extension if available, with runtime check.
+ if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+ USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
else
- # Use ARM CRC Extension if available.
- if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
- USE_ARMV8_CRC32C=1
+ # LoongArch CRCC instructions.
+ if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+ USE_LOONGARCH_CRC32C=1
else
- # ARM CRC Extension, with runtime check?
- if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
- USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
- else
- # LoongArch CRCC instructions.
- if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
- USE_LOONGARCH_CRC32C=1
- else
- # fall back to slicing-by-8 algorithm, which doesn't require any
- # special CPU support.
- USE_SLICING_BY_8_CRC32C=1
- fi
- fi
+ # fall back to slicing-by-8 algorithm, which doesn't require any
+ # special CPU support.
+ USE_SLICING_BY_8_CRC32C=1
fi
fi
fi
@@ -18092,54 +18056,36 @@ fi
# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which CRC-32C implementation to use" >&5
$as_echo_n "checking which CRC-32C implementation to use... " >&6; }
-if test x"$USE_SSE42_CRC32C" = x"1"; then
-
-$as_echo "#define USE_SSE42_CRC32C 1" >>confdefs.h
-
- PG_CRC32C_OBJS="pg_crc32c_sse42.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
-$as_echo "SSE 4.2" >&6; }
-else
- if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
$as_echo "#define USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
- PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+ PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
$as_echo "SSE 4.2 with runtime check" >&6; }
- else
- if test x"$USE_ARMV8_CRC32C" = x"1"; then
-
-$as_echo "#define USE_ARMV8_CRC32C 1" >>confdefs.h
-
- PG_CRC32C_OBJS="pg_crc32c_armv8.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions" >&5
-$as_echo "ARMv8 CRC instructions" >&6; }
- else
- if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+else
+ if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
$as_echo "#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK 1" >>confdefs.h
- PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
+ PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: ARMv8 CRC instructions with runtime check" >&5
$as_echo "ARMv8 CRC instructions with runtime check" >&6; }
- else
- if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+ else
+ if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
$as_echo "#define USE_LOONGARCH_CRC32C 1" >>confdefs.h
- PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
+ PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: LoongArch CRCC instructions" >&5
$as_echo "LoongArch CRCC instructions" >&6; }
- else
+ else
$as_echo "#define USE_SLICING_BY_8_CRC32C 1" >>confdefs.h
- PG_CRC32C_OBJS="pg_crc32c_sb8.o"
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
+ PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: slicing-by-8" >&5
$as_echo "slicing-by-8" >&6; }
- fi
- fi
fi
fi
fi
diff --git a/configure.ac b/configure.ac
index f220b379b3..10286e415b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2081,14 +2081,6 @@ if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
fi
-# Are we targeting a processor that supports SSE 4.2? gcc, clang and icc all
-# define __SSE4_2__ in that case.
-AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [
-#ifndef __SSE4_2__
-#error __SSE4_2__ not defined
-#endif
-])], [SSE4_2_TARGETED=1])
-
# Check for ARMv8 CRC Extension intrinsics to do CRC calculations.
#
# First check if __crc32c* intrinsics can be used with the default compiler
@@ -2109,50 +2101,36 @@ AC_SUBST(CFLAGS_CRC)
# Select CRC-32C implementation.
#
-# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
-# use the special CRC instructions for calculating CRC-32C. If we're not
-# targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# If we are targeting a processor that has Intel SSE 4.2 instructions, or we're
+# not targeting such a processor, but we can nevertheless produce code that
+# uses the SSE intrinsics, perhaps with some extra CFLAGS, compile both
# implementations and select which one to use at runtime, depending on whether
# SSE 4.2 is supported by the processor we're running on.
#
# Similarly, if we are targeting an ARM processor that has the CRC
-# instructions that are part of the ARMv8 CRC Extension, use them. And if
-# we're not targeting such a processor, but can nevertheless produce code that
-# uses the CRC instructions, compile both, and select at runtime.
-#
-# You can skip the runtime check by setting the appropriate USE_*_CRC32 flag to 1
-# in the template or configure command line.
+# instructions that are part of the ARMv8 CRC Extension, or if we're not
+# targeting such a processor, but can nevertheless produce code that uses the
+# CRC instructions, compile both, and select at runtime.
#
# If we are targeting a LoongArch processor, CRC instructions are
# always available (at least on 64 bit), so no runtime check is needed.
-if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
- # Use Intel SSE 4.2 if available.
- if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
- USE_SSE42_CRC32C=1
- else
- # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
- # the runtime check.
- if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then
+ # Use Intel SSE 4.2 if available, with runtime check. The CPUID instruction
+ # is needed for the runtime check.
+ if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
+ else
+ # Use ARM CRC Extension if available, with runtime check.
+ if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
+ USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
else
- # Use ARM CRC Extension if available.
- if test x"$pgac_armv8_crc32c_intrinsics" = x"yes" && test x"$CFLAGS_CRC" = x""; then
- USE_ARMV8_CRC32C=1
+ # LoongArch CRCC instructions.
+ if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
+ USE_LOONGARCH_CRC32C=1
else
- # ARM CRC Extension, with runtime check?
- if test x"$pgac_armv8_crc32c_intrinsics" = x"yes"; then
- USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK=1
- else
- # LoongArch CRCC instructions.
- if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then
- USE_LOONGARCH_CRC32C=1
- else
- # fall back to slicing-by-8 algorithm, which doesn't require any
- # special CPU support.
- USE_SLICING_BY_8_CRC32C=1
- fi
- fi
+ # fall back to slicing-by-8 algorithm, which doesn't require any
+ # special CPU support.
+ USE_SLICING_BY_8_CRC32C=1
fi
fi
fi
@@ -2160,36 +2138,24 @@ fi
# Set PG_CRC32C_OBJS appropriately depending on the selected implementation.
AC_MSG_CHECKING([which CRC-32C implementation to use])
-if test x"$USE_SSE42_CRC32C" = x"1"; then
- AC_DEFINE(USE_SSE42_CRC32C, 1, [Define to 1 use Intel SSE 4.2 CRC instructions.])
- PG_CRC32C_OBJS="pg_crc32c_sse42.o"
- AC_MSG_RESULT(SSE 4.2)
+if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+ AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
+ PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
+ AC_MSG_RESULT(SSE 4.2 with runtime check)
else
- if test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
- AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check.])
- PG_CRC32C_OBJS="pg_crc32c_sse42.o pg_crc32c_sb8.o pg_crc32c_sse42_choose.o"
- AC_MSG_RESULT(SSE 4.2 with runtime check)
+ if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
+ AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
+ PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
+ AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
else
- if test x"$USE_ARMV8_CRC32C" = x"1"; then
- AC_DEFINE(USE_ARMV8_CRC32C, 1, [Define to 1 to use ARMv8 CRC Extension.])
- PG_CRC32C_OBJS="pg_crc32c_armv8.o"
- AC_MSG_RESULT(ARMv8 CRC instructions)
+ if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
+ AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
+ PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
+ AC_MSG_RESULT(LoongArch CRCC instructions)
else
- if test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then
- AC_DEFINE(USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARMv8 CRC Extension with a runtime check.])
- PG_CRC32C_OBJS="pg_crc32c_armv8.o pg_crc32c_sb8.o pg_crc32c_armv8_choose.o"
- AC_MSG_RESULT(ARMv8 CRC instructions with runtime check)
- else
- if test x"$USE_LOONGARCH_CRC32C" = x"1"; then
- AC_DEFINE(USE_LOONGARCH_CRC32C, 1, [Define to 1 to use LoongArch CRCC instructions.])
- PG_CRC32C_OBJS="pg_crc32c_loongarch.o"
- AC_MSG_RESULT(LoongArch CRCC instructions)
- else
- AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
- PG_CRC32C_OBJS="pg_crc32c_sb8.o"
- AC_MSG_RESULT(slicing-by-8)
- fi
- fi
+ AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).])
+ PG_CRC32C_OBJS="pg_crc32c_sb8.o"
+ AC_MSG_RESULT(slicing-by-8)
fi
fi
fi
diff --git a/meson.build b/meson.build
index 2d516c8f37..65e00e1538 100644
--- a/meson.build
+++ b/meson.build
@@ -1983,17 +1983,16 @@ endif
###############################################################
# Select CRC-32C implementation.
#
-# If we are targeting a processor that has Intel SSE 4.2 instructions, we can
-# use the special CRC instructions for calculating CRC-32C. If we're not
-# targeting such a processor, but we can nevertheless produce code that uses
-# the SSE intrinsics, perhaps with some extra CFLAGS, compile both
+# If we are targeting a processor that has Intel SSE 4.2 instructions, or we're
+# not targeting such a processor, but we can nevertheless produce code that
+# uses the SSE intrinsics, perhaps with some extra CFLAGS, compile both
# implementations and select which one to use at runtime, depending on whether
# SSE 4.2 is supported by the processor we're running on.
#
# Similarly, if we are targeting an ARM processor that has the CRC
-# instructions that are part of the ARMv8 CRC Extension, use them. And if
-# we're not targeting such a processor, but can nevertheless produce code that
-# uses the CRC instructions, compile both, and select at runtime.
+# instructions that are part of the ARMv8 CRC Extension, or if we're not
+# targeting such a processor, but can nevertheless produce code that uses the
+# CRC instructions, compile both, and select at runtime.
###############################################################
have_optimized_crc = false
@@ -2001,7 +2000,6 @@ cflags_crc = []
if host_cpu == 'x86' or host_cpu == 'x86_64'
if cc.get_id() == 'msvc'
- cdata.set('USE_SSE42_CRC32C', false)
cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
have_optimized_crc = true
else
@@ -2020,16 +2018,12 @@ int main(void)
'''
if cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2',
- args: test_c_args)
- # Use Intel SSE 4.2 unconditionally.
- cdata.set('USE_SSE42_CRC32C', 1)
- have_optimized_crc = true
- elif cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
+ args: test_c_args) or \
+ cc.links(prog, name: '_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2',
args: test_c_args + ['-msse4.2'])
# Use Intel SSE 4.2, with runtime check. The CPUID instruction is needed for
# the runtime check.
cflags_crc += '-msse4.2'
- cdata.set('USE_SSE42_CRC32C', false)
cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1)
have_optimized_crc = true
endif
@@ -2055,15 +2049,10 @@ int main(void)
'''
if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
- args: test_c_args)
- # Use ARM CRC Extension unconditionally
- cdata.set('USE_ARMV8_CRC32C', 1)
- have_optimized_crc = true
- elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
+ args: test_c_args) or \
+ cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
args: test_c_args + ['-march=armv8-a+crc'])
# Use ARM CRC Extension, with runtime check
- cflags_crc += '-march=armv8-a+crc'
- cdata.set('USE_ARMV8_CRC32C', false)
cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
have_optimized_crc = true
endif
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d8a2985567..dd537c3169 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -683,9 +683,6 @@
/* Define to 1 if strerror_r() returns int. */
#undef STRERROR_R_INT
-/* Define to 1 to use ARMv8 CRC Extension. */
-#undef USE_ARMV8_CRC32C
-
/* Define to 1 to use ARMv8 CRC Extension with a runtime check. */
#undef USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
@@ -732,9 +729,6 @@
/* Define to 1 to use software CRC-32C implementation (slicing-by-8). */
#undef USE_SLICING_BY_8_CRC32C
-/* Define to 1 use Intel SSE 4.2 CRC instructions. */
-#undef USE_SSE42_CRC32C
-
/* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
#undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..cd2ccebfd6 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -41,24 +41,7 @@ typedef uint32 pg_crc32c;
#define INIT_CRC32C(crc) ((crc) = 0xFFFFFFFF)
#define EQ_CRC32C(c1, c2) ((c1) == (c2))
-#if defined(USE_SSE42_CRC32C)
-/* Use Intel SSE4.2 instructions. */
-#define COMP_CRC32C(crc, data, len) \
- ((crc) = pg_comp_crc32c_sse42((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_ARMV8_CRC32C)
-/* Use ARMv8 CRC Extension instructions. */
-
-#define COMP_CRC32C(crc, data, len) \
- ((crc) = pg_comp_crc32c_armv8((crc), (data), (len)))
-#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-
-extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len);
-
-#elif defined(USE_LOONGARCH_CRC32C)
+#if defined(USE_LOONGARCH_CRC32C)
/* Use LoongArch CRCC instructions. */
#define COMP_CRC32C(crc, data, len) \
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..6fc8ac9953 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -81,13 +81,11 @@ endif
# is true
replace_funcs_pos = [
# x86/x64
- ['pg_crc32c_sse42', 'USE_SSE42_CRC32C'],
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
# arm / aarch64
- ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a50f730260..496f11bf54 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -421,7 +421,6 @@ sub GenerateFiles
SIZEOF_VOID_P => $bits / 8,
STDC_HEADERS => 1,
STRERROR_R_INT => undef,
- USE_ARMV8_CRC32C => undef,
USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => undef,
USE_ASSERT_CHECKING => $self->{options}->{asserts} ? 1 : undef,
USE_BONJOUR => undef,
@@ -437,7 +436,6 @@ sub GenerateFiles
USE_OPENSSL => undef,
USE_PAM => undef,
USE_SLICING_BY_8_CRC32C => undef,
- USE_SSE42_CRC32C => undef,
USE_SSE42_CRC32C_WITH_RUNTIME_CHECK => 1,
USE_SYSTEMD => undef,
USE_SYSV_SEMAPHORES => undef,
--
2.37.1 (Apple Git-137.1)
Nathan Bossart <nathandbossart@gmail.com> writes:
The aforementioned other thread [0] aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks. $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification. Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.
On the one hand, I agree that we need to keep the complexity from
getting out of hand. On the other hand, I wonder if this approach
isn't optimizing for the wrong case. How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)? It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions. Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.
I'm not quite sure what to propose instead, though. I thought for
a little bit about a configure switch to select "test first" or
"pedal to the metal". But in practice, package builders would
probably have to select the conservative "test first" option; and
we know that the vast majority of modern installations use prebuilt
packages, so it's not clear that this answer would help many people.
Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about. If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.
regards, tom lane
On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote:
It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions.
It's already doing a call to pg_comp_crc32c_sse42() regardless, right?
I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?
Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?
Regards,
Jeff Davis
On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote:
On the one hand, I agree that we need to keep the complexity from
getting out of hand. On the other hand, I wonder if this approach
isn't optimizing for the wrong case. How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)?
For the CRC instructions in use today, I wouldn't be surprised if that
number is pretty small, but for newer or optional instructions (like ARM's
PMULL), I don't think we'll be so lucky. Even if we do feel comfortable
assuming the presence of SSE 4.2, etc., we'll likely still need to add
runtime checks for future optimizations.
It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions. Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.
When I went to measure this, I noticed that my relatively new x86 machine
with a relatively new version of gcc uses the runtime check. I then
skimmed through a few dozen buildfarm machines and found that, of all x86
and ARM machines that supported the specialized CRC instructions, only one
ARM machine did not use the runtime check. Of course, this is far from a
scientific data point, but it seems to indicate that the runtime check is
the norm.
(I still need to measure it.)
Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about. If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.
Will do.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote:
I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?
I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:
* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38s
Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:
I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38sAvoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.
After reflecting on these numbers for a bit, I think I'm still inclined to
do $SUBJECT. I considered the following:
* While it would be nice to gain a couple of percentage points for existing
hardware, I think we'll still end up doing runtime checks most of the
time once we add support for newer instructions.
* The performance improvements that the new instructions provide seem
likely to outweigh these small regressions, especially for workloads with
larger WAL records [0]/messages/by-id/20231025014539.GA977906@nathanxps13.
* From my quick scan of a few dozen machines on the buildfarm, it looks
like the runtime checks are already the norm, so the number of systems
that would be subject to a regression from v16 to v17 should be pretty
small, in theory. And this regression seems to be on the order of 1%
based on the numbers above.
Do folks think this is reasonable? Or should we instead try to squeeze
every last drop out of the current implementations by avoiding function
pointers, forcing inlining, etc.?
[0]: /messages/by-id/20231025014539.GA977906@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:
I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38sAvoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.
I find these numbers fairly concerning. If you can see a
couple-of-percent slowdown on a macroscopic benchmark like pg_waldump,
that implies that the percentage slowdown considering the CRC
operation alone is much worse. So there may be other use-cases where
we would take a bigger relative hit.
* From my quick scan of a few dozen machines on the buildfarm, it looks
like the runtime checks are already the norm, so the number of systems
that would be subject to a regression from v16 to v17 should be pretty
small, in theory. And this regression seems to be on the order of 1%
based on the numbers above.
I did a more thorough scrape of the buildfarm results. Of 161 animals
currently reporting configure output on HEAD, we have
2 ARMv8 CRC instructions
36 ARMv8 CRC instructions with runtime check
2 LoongArch CRCC instructions
2 SSE 4.2
52 SSE 4.2 with runtime check
67 slicing-by-8
While that'd seem to support your conclusion, the two using ARM CRC
*without* a runtime check are my Apple M1 Mac animals (sifaka/indri);
and I see the same selection on my laptop. So one platform where
we'd clearly be taking a regression is M-series Macs; that's a pretty
popular platform. The two using SSE without a check are prion and
tayra. I notice those are using gcc 11; so perhaps the default cflags
have changed to include -msse4.2 recently? I couldn't see much other
pattern though. (Scraping results attached in case anybody wants to
look.)
Really this just reinforces my concern that doing a runtime check
all the time is on the wrong side of history. I grant that we've
got to do that for anything where the availability of the instruction
is really in serious question, but I'm not very convinced that that's
a majority situation on popular platforms.
regards, tom lane
Attachments:
I wrote:
I did a more thorough scrape of the buildfarm results. Of 161 animals
currently reporting configure output on HEAD, we have
Oh ... take "current" with a grain of salt there, because I just noticed
that I typo'd my search condition so that it collected results from all
systems that reported since 2022-Oct, rather than in the last month as
I'd intended. There are just 137 animals currently reporting.
Of those, I broke down the architectures reporting using slicing-by-8:
# select arch,count(*) from results where crc = 'slicing-by-8' group by 1 order by 1;
arch | count
--------------------+-------
aarch64 | 1
macppc | 1
mips64eb; -mabi=64 | 1
mips64el; -mabi=32 | 1
ppc64 (power7) | 4
ppc64 (power8) | 2
ppc64le | 7
ppc64le (power8) | 1
ppc64le (power9) | 15
riscv64 | 2
s390x (z15) | 14
sparc | 1
(12 rows)
The one machine using slicing-by-8 where there might be a better
alternative is arowana, which is CentOS 7 with a pretty ancient gcc
version. So I reject the idea that slicing-by-8 is an appropriate
baseline for comparisons. There isn't anybody who will see an
improvement over current behavior: in the population of interest,
just about all platforms are using CRC instructions with or without
a runtime check.
regards, tom lane
On Tue, Oct 31, 2023 at 03:16:16PM -0400, Tom Lane wrote:
Really this just reinforces my concern that doing a runtime check
all the time is on the wrong side of history. I grant that we've
got to do that for anything where the availability of the instruction
is really in serious question, but I'm not very convinced that that's
a majority situation on popular platforms.
Okay. With that in mind, I think the path forward for new instructions is
as follows:
* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)
Any platform that can use the CRC instructions with default compiler flags
but not the new instructions wouldn't be able to take advantage of the
proposed optimization, but it also wouldn't be subject to the small
performance regression.
If we wanted to further eliminate runtime checks for SSE 4.2 and ARMv8,
then I think things become a little trickier, as having a compiler that
understands things like +crypto would mean that you're automatically
subject to the runtime check regression (assuming we proceed with the
proposed optimization). An alternate approach could be to only use newer
instructions if they are available with the default compiler flags, but
given the current state of the buildfarm, such optimizations might not get
much uptake for a while.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 31, 2023 at 03:42:33PM -0400, Tom Lane wrote:
The one machine using slicing-by-8 where there might be a better
alternative is arowana, which is CentOS 7 with a pretty ancient gcc
version. So I reject the idea that slicing-by-8 is an appropriate
baseline for comparisons. There isn't anybody who will see an
improvement over current behavior: in the population of interest,
just about all platforms are using CRC instructions with or without
a runtime check.
I only included the slicing-by-8 benchmark to demonstrate that 1) the CRC
computations are a big portion of that pg_waldump -z command and that 2)
the CRC instructions provide significant performance gains.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
Okay. With that in mind, I think the path forward for new instructions is
as follows:
* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)
This seems like a reasonable proposal.
Any platform that can use the CRC instructions with default compiler flags
but not the new instructions wouldn't be able to take advantage of the
proposed optimization, but it also wouldn't be subject to the small
performance regression.
Check. For now I think that's fine. If we get to a place where this
policy is really leaving a lot of performance on the table, we can
revisit it ... but that situation is hypothetical and may remain so.
(It's worth noting also that a package builder can move the goalposts
at will, since our idea of "default flags" is really whatever the user
says to use.)
regards, tom lane
On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
Okay. With that in mind, I think the path forward for new instructions is
as follows:* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)This seems like a reasonable proposal.
Great. I think that leaves us with nothing left to do for this thread, so
I'll withdraw it from the commitfest and move the discussion back to the
original thread.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 31, 2023 at 03:38:17PM -0500, Nathan Bossart wrote:
On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:
This seems like a reasonable proposal.
Great. I think that leaves us with nothing left to do for this thread, so
I'll withdraw it from the commitfest and move the discussion back to the
original thread.
(Also, thanks for the discussion.)
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com