Improve CRC32C performance on SSE4.2
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium's copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.8147 -0.8148 33 6 33 6
[scalar_crc32c vs. sse42_crc32c]/128 -0.8962 -0.8962 88 9 88 9
[scalar_crc32c vs. sse42_crc32c]/256 -0.9200 -0.9200 211 17 211 17
[scalar_crc32c vs. sse42_crc32c]/512 -0.9389 -0.9389 486 30 486 30
[scalar_crc32c vs. sse42_crc32c]/1024 -0.9452 -0.9452 1037 57 1037 57
[scalar_crc32c vs. sse42_crc32c]/2048 -0.9456 -0.9456 2140 116 2140 116
Raghuveer
On Thu, Feb 6, 2025 at 3:49 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Thanks for the patch!
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
|----------------------------------+---------------------+---------------+---------------|
| Benchmark | Buffer size (bytes) | Time Old
(ns) | Time New (ns) |
|----------------------------------+---------------------+---------------+---------------|
| [scalar_crc32c vs. sse42_crc32c] | 64 | 33
| 6 |
| [scalar_crc32c vs. sse42_crc32c] | 128 | 88
| 9 |
| [scalar_crc32c vs. sse42_crc32c] | 256 | 211
| 17 |
| [scalar_crc32c vs. sse42_crc32c] | 512 | 486
| 30 |
| [scalar_crc32c vs. sse42_crc32c] | 1024 | 1037
| 57 |
| [scalar_crc32c vs. sse42_crc32c] | 2048 | 2140
| 116 |
|----------------------------------+---------------------+---------------+---------------|
I'm highly suspicious of these numbers because they show this version
is about 20x faster than "scalar", so relatively speaking 3x faster
than the AVX-512 proposal? I ran my own benchmarks with the attached
script using your test function from the other thread, and found this
patch is actually slower than master on anything smaller than 256
bytes. Luckily, that's easily fixable: It turns out the implementation
in the paper (and chromium) has a very inefficient finalization step,
using more carryless multiplications and plenty of other operations.
After the main loop, and at the end, it's much more efficient to
convert the 128-bit intermediate result directly into a CRC in the
usual way. See here for details:
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
The author of this article also has an MIT-licensed program to
generate CRC implementations with specified requirements (including on
ARM):
https://github.com/corsix/fast-crc32
I generated a similar function for v2-0004 and this benchmark shows
it's faster than master on 128 bytes and above, which is encouraging
(see attached graph). As I mentioned in the other thread, we need to
be mindful of the fact that the latency of carryless multiplication
varies wildly on different microarchitectures. I did the benchmarks on
my older machine, which I believe has a latency of 7 cycles for this
instruction. Looking around *briefly*, the most recent x86 chips with
worse latency seem to be from around 2012-13 (e.g. Intel Silvermont
and AMD Piledriver). Chips newer than mine have a latency of 4-7
cycles. It seems okay to assume those who care the most about
performance will be on hardware less than 12 years old, but I'm open
to arguments to be more conservative here.
Large inputs would make the graph hard to read, so I'll just put the
results for 8kB here:
master: 1504ms
v1: 543ms
v2: 533ms
Some thoughts on building (not a complete review):
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>
+ #include <wmmintrin.h>
#if defined(__has_attribute) && __has_attribute (target)
- __attribute__((target("sse4.2")))
+ __attribute__((target("sse4.2,pclmul")))
It's probably okay to fold these together in the same compile-time
check, since both are fairly old by now, but for those following
along, pclmul is not in SSE 4.2 and is a bit newer. So this would
cause machines building on Nehalem (2008) to fail the check and go
back to slicing-by-8 with it written this way. That's a long time ago,
so I'm not sure if anyone would notice, and I think we could fix it
for people using a packaged binary by having a fallback wrapper
function that just calls the SSE 4.2 "tail", as 0002 calls it.
--
John Naylor
Amazon Web Services
Attachments:
crc_results_20250209.pngimage/png; name=crc_results_20250209.pngDownload
v2-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
v2-0003-Improve-CRC32C-performance-on-SSE4.2.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Improve-CRC32C-performance-on-SSE4.2.patchDownload+141-8
v2-0001-Add-more-test-coverage-for-crc32c.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-more-test-coverage-for-crc32c.patchDownload+56-1
v2-0004-Shorter-version-from-corsix.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Shorter-version-from-corsix.patchDownload+62-104
Hi John,
I'm highly suspicious of these numbers because they show this version
is about 20x faster than "scalar", so relatively speaking 3x faster
than the AVX-512 proposal?
Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:
Comparing scalar_crc32c to sse42_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.0972 -0.0971 5 4 5 4
[scalar_crc32c vs. sse42_crc32c]/128 -0.3048 -0.3048 8 6 8 6
[scalar_crc32c vs. sse42_crc32c]/256 -0.4610 -0.4610 19 10 19 10
[scalar_crc32c vs. sse42_crc32c]/512 -0.6432 -0.6432 50 18 50 18
[scalar_crc32c vs. sse42_crc32c]/1024 -0.7192 -0.7192 121 34 121 34
[scalar_crc32c vs. sse42_crc32c]/2048 -0.7275 -0.7276 259 70 259 70
Luckily, that's easily fixable: It turns out the implementation
in the paper (and chromium) has a very inefficient finalization step,
using more carryless multiplications and plenty of other operations.
After the main loop, and at the end, it's much more efficient to
convert the 128-bit intermediate result directly into a CRC in the
usual way.
Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 - 512 bytes:
Comparing sse42_crc32c to corsix_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[sse42_crc32c vs. corsix_crc32c]/64 -0.2696 -0.2696 4 3 4 3
[sse42_crc32c vs. corsix_crc32c]/128 -0.1551 -0.1552 6 5 6 5
[sse42_crc32c vs. corsix_crc32c]/256 -0.1787 -0.1787 10 8 10 8
[sse42_crc32c vs. corsix_crc32c]/512 -0.1351 -0.1351 18 15 18 15
[sse42_crc32c vs. corsix_crc32c]/1024 -0.0972 -0.0972 34 31 34 31
[sse42_crc32c vs. corsix_crc32c]/2048 -0.0763 -0.0763 69 64 69 64
OVERALL_GEOMEAN -0.1544 -0.1544 0 0 0 0
I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.
I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes - 128 bytes. Could you confirm that behavior on your end too?
---
src/port/pg_crc32c_sse42.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index a8c1e5609b..a350b1b93a 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))
#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))
pg_attribute_no_sanitize_alignment()
+__attribute__((optimize("-O3")))
pg_attribute_target("sse4.2,pclmul")
pg_crc32c
pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)
--
You could also just build with export CFLAGS="-O3" instead of adding the function attribute.
I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.
It's probably okay to fold these together in the same compile-time
check, since both are fairly old by now, but for those following
along, pclmul is not in SSE 4.2 and is a bit newer. So this would
cause machines building on Nehalem (2008) to fail the check and go
back to slicing-by-8 with it written this way.
Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly. Let me know which one you would prefer.
Raghuveer
From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium's copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.8147 -0.8148 33 6 33 6
[scalar_crc32c vs. sse42_crc32c]/128 -0.8962 -0.8962 88 9 88 9
[scalar_crc32c vs. sse42_crc32c]/256 -0.9200 -0.9200 211 17 211 17
[scalar_crc32c vs. sse42_crc32c]/512 -0.9389 -0.9389 486 30 486 30
[scalar_crc32c vs. sse42_crc32c]/1024 -0.9452 -0.9452 1037 57 1037 57
[scalar_crc32c vs. sse42_crc32c]/2048 -0.9456 -0.9456 2140 116 2140 116
Raghuveer
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
That matches my findings as well.
I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?
On my machine that still regresses compared to master in that range
(although by not as much) so I still think 128 bytes is the right
threshold.
The effect of -O3 with gcc14.2 is that the single-block loop (after
the 4-block loop) is unrolled. Unrolling adds branches and binary
space, so it'd be nice to avoid that even for systems that build with
-O3. I tried leaving out the single block loop, so that the tail is
called for the remaining <63 bytes, and it's actually better:
v2:
128
latency average = 10.256 ms
144
latency average = 11.393 ms
160
latency average = 12.977 ms
176
latency average = 14.364 ms
192
latency average = 12.627 ms
remove single-block loop:
128
latency average = 10.211 ms
144
latency average = 10.987 ms
160
latency average = 12.094 ms
176
latency average = 12.927 ms
192
latency average = 12.466 ms
Keeping the extra loop is probably better at this benchmark on newer
machines, but I don't think it's worth it. Microbenchmarks with fixed
sized input don't take branch mispredicts into account.
I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.
It's an i7-7500U / Kaby Lake.
It's probably okay to fold these together in the same compile-time
check, since both are fairly old by now, but for those following
along, pclmul is not in SSE 4.2 and is a bit newer. So this would
cause machines building on Nehalem (2008) to fail the check and go
back to slicing-by-8 with it written this way.Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly. Let me know which one you would prefer.
Okay, Nehalem is 17 years old, and the additional cpuid check would
still work on hardware 14-15 years old, so I think it's fine to bump
the requirement for runtime hardware support.
--
John Naylor
Amazon Web Services
Hello,
Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.
I ran the same benchmark drive_crc32c with the postgres infrastructure and
found that your v2 sse42 version from corsix is slower than
pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.That matches my findings as well.
Never mind, I was building using the Makefile which doesn’t seem to add any optimization flag by default. I switched to using meson which uses -O2 and benchmarked using pgbench (using your script) and this behavior goes away on my TGL. Here is what I measure with your v2 (and v3):
| bytes | master (ms) | sse4.2-v2 (ms) | ratio |
| 64 | 9.627 | 6.306 | 1.52 |
| 80 | 10.976 | 6.662 | 1.64 |
| 96 | 12.411 | 8.212 | 1.51 |
| 112 | 13.871 | 9.403 | 1.47 |
| 128 | 15.283 | 7.724 | 1.97 |
| 144 | 16.715 | 9.173 | 1.82 |
| 160 | 18.18 | 11.292 | 1.60 |
| 176 | 19.847 | 12.606 | 1.57 |
| 192 | 22.043 | 10.16 | 2.16 |
| 208 | 24.261 | 11.699 | 2.07 |
| 224 | 26.63 | 13.607 | 1.95 |
| 240 | 28.994 | 14.721 | 1.96 |
| 256 | 31.418 | 13.132 | 2.39 |
On my machine that still regresses compared to master in that range (although by
not as much) so I still think 128 bytes is the right threshold.
On my TGL, buffer sizes as small as 64 bytes see performance benefits.
The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop)
is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that
even for systems that build with -O3.
Agreed. My perf data shows -O2 is just as good.
Okay, Nehalem is 17 years old, and the additional cpuid check would still work on
hardware 14-15 years old, so I think it's fine to bump the requirement for runtime
hardware support.
Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm will run only on Westmere and newer CPU.
Raghuveer
Attachments:
v3-0001-Add-more-test-coverage-for-crc32c.patchapplication/octet-stream; name=v3-0001-Add-more-test-coverage-for-crc32c.patchDownload+56-1
v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchapplication/octet-stream; name=v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
v3-0003-Improve-CRC32C-performance-on-SSE4.2.patchapplication/octet-stream; name=v3-0003-Improve-CRC32C-performance-on-SSE4.2.patchDownload+146-11
v3-0004-Shorter-version-from-corsix.patchapplication/octet-stream; name=v3-0004-Shorter-version-from-corsix.patchDownload+62-104
On Wed, Feb 12, 2025 at 4:34 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
On my machine that still regresses compared to master in that range (although by
not as much) so I still think 128 bytes is the right threshold.On my TGL, buffer sizes as small as 64 bytes see performance benefits.
Yeah, I'm guessing it's because newer chips will have better IPC. We
need to take care not to regress for hardware that's only 5-10 years
old.
Attached is v4:
0001: Same perf test, just in case
0002-4: I redid adding the implementation (without the single-block
loop and with 128-byte threshold), but split out the steps for
reference, as a model for possible ARM support in the future. These
will all be squashed on commit. The upstream code has very long lines,
even after running pgindent, so some may find that objectionable. We
could easily turn some commas into semicolons, and then it'll wrap
more nicely. I just wanted to change as little as possible for now. (I
also need to check if I need to put more license text here..)
0005: This has a fleshed-out draft commit message, but otherwise is
just the same configure/choose support as v3.
Some review comments:
1. Some of the comments that only mention SSE 4.2 in the compile- and
run-time checks need to be updated.
Okay, Nehalem is 17 years old, and the additional cpuid check would still work on
hardware 14-15 years old, so I think it's fine to bump the requirement for runtime
hardware support.Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm will run only on Westmere and newer CPU.
2. Unfortunately, there is another wrinkle that I failed to consider: If you
search the web for "VirtualBox pclmulqdq" you can see a few reports from not
very long ago that some hypervisors don't enable the CPUID for pclmul. I
don't know how big a problem that is in practice today, but it seems we
should actually have separate checks, with fallback. Sorry I didn't
think of this earlier.
3. Note: I left out the new test file from v3-0001. We should have
tests, but note we already have some CRC tests in
src/test/regress/sql/strings.sql -- let's put new ones there. Also,
for the longer strings we want to test, it's easier to read/verify to
use something like
SELECT crc32c(repeat('A', 128)::bytea);
Maybe it's sufficient to have 127, 128, 129 for lengths, and maybe a
couple more.
--
John Naylor
Amazon Web Services
Attachments:
v4-0002-Vendor-SSE-implementation-from-https-github.com-c.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Vendor-SSE-implementation-from-https-github.com-c.patchDownload+77-1
v4-0005-Improve-CRC32C-performance-on-x86_64.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Improve-CRC32C-performance-on-x86_64.patchDownload+27-8
v4-0004-Run-pgindent-XXX-Some-lines-are-still-really-long.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Run-pgindent-XXX-Some-lines-are-still-really-long.patchDownload+53-43
v4-0003-Adjust-previous-commit-to-match-our-style-add-128.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Adjust-previous-commit-to-match-our-style-add-128.patchDownload+14-35
v4-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
Hi,
2. Unfortunately, there is another wrinkle that I failed to consider: If you search
the web for "VirtualBox pclmulqdq" you can see a few reports from not very long
ago that some hypervisors don't enable the CPUID for pclmul. I don't know how
big a problem that is in practice today, but it seems we should actually have
separate checks, with fallback. Sorry I didn't think of this earlier.
If someone using a VM that doesn't support a 15 yr old feature, then I would argue performance is not
the top priority for them. But that’s up to you. I will work on adding it unless you change your mind.
Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
features support? The former macro is used to enable running the SSE42 version without a runtime check
when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance.
Let me know what you think.
Raghuveer
On Wed, Feb 12, 2025 at 09:02:27PM +0000, Devulapalli, Raghuveer wrote:
Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
features support? The former macro is used to enable running the SSE42 version without a runtime check
when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance.
Let me know what you think.
I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible. I looked at switching to always using runtime checks
for this stuff, and we concluded that we'd better not [0]/messages/by-id/flat/20231030161706.GA3011@nathanxps13.
[0]: /messages/by-id/flat/20231030161706.GA3011@nathanxps13
--
nathan
I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible. I looked at switching to always using runtime checks for this
stuff, and we concluded that we'd better not [0].
Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will look a little ugly, I might need to think about a cleaner way to do this.
Raghuveer
On Wed, Feb 12, 2025 at 09:48:57PM +0000, Devulapalli, Raghuveer wrote:
I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible. I looked at switching to always using runtime checks for this
stuff, and we concluded that we'd better not [0].Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will look a little ugly, I might need to think about a cleaner way to do this.
Well, I suspect the AVX-512 version will pretty much always need the
runtime check given that its not available on a lot of newer hardware and
requires a bunch of extra runtime checks (see pg_popcount_avx512.c). But
it might be worth doing for PCLMUL. Otherwise, I think we'd have to leave
out the PCLMUL optimizations if built with -msse4.2 -mpclmul because we
don't want to regress existing -msse4.2 users with a runtime check.
--
nathan
Well, I suspect the AVX-512 version will pretty much always need the runtime
check given that its not available on a lot of newer hardware and requires a
bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be
worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL
optimizations if built with -msse4.2 -mpclmul because we don't want to regress
existing -msse4.2 users with a runtime check.
Sounds good to me. Although, users building with just -msse4.2 will now encounter an
an additional pclmul runtime check. That would be a regression unless they update to
building with both -msse4.2 and -mpclmul.
Raghuveer
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote:
Well, I suspect the AVX-512 version will pretty much always need the runtime
check given that its not available on a lot of newer hardware and requires a
bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be
worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL
optimizations if built with -msse4.2 -mpclmul because we don't want to regress
existing -msse4.2 users with a runtime check.Sounds good to me. Although, users building with just -msse4.2 will now encounter an
an additional pclmul runtime check. That would be a regression unless they update to
building with both -msse4.2 and -mpclmul.
My thinking was that building with just -msse4.2 would cause the existing
SSE 4.2 implementation to be used (without the function pointer). That's
admittedly a bit goofy because they'd miss out on the PCLMUL optimization,
but things at least don't get any worse for them.
--
nathan
Sounds good to me. Although, users building with just -msse4.2 will
now encounter an an additional pclmul runtime check. That would be a
regression unless they update to building with both -msse4.2 and -mpclmul.My thinking was that building with just -msse4.2 would cause the existing SSE 4.2
implementation to be used (without the function pointer). That's admittedly a bit
goofy because they'd miss out on the PCLMUL optimization, but things at least
don't get any worse for them.
Right. We are only talking about a regression for small potion of people who build
with -msse4.2 and run on Nehalem/VM with pclmul disabled where we will run
the cpuid check for pclmul and still pick the sse4.2 version.
Raghuveer
On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible. I looked at switching to always using runtime checks
for this stuff, and we concluded that we'd better not [0].
For short lengths, I tried unrolling the loop into a switch statement,
as in the attached v5-0006 (the other new patches are fixes for CI).
That usually looks faster for me, but not on the length used under the
WAL insert lock. Usual caveat: Using small fixed-sized lengths in
benchmarks can be misleading, because branches are more easily
predicted.
It seems like for always using runtime checks we'd need to use
branching, rather than function pointers, as has been proposed
elsewhere.
master:
20
latency average = 3.622 ms
latency average = 3.573 ms
latency average = 3.599 ms
64
latency average = 7.791 ms
latency average = 7.920 ms
latency average = 7.888 ms
80
latency average = 8.076 ms
latency average = 8.140 ms
latency average = 8.150 ms
96
latency average = 8.853 ms
latency average = 8.897 ms
latency average = 8.914 ms
112
latency average = 9.867 ms
latency average = 9.825 ms
latency average = 9.869 ms
v5:
20
latency average = 4.550 ms
latency average = 4.327 ms
latency average = 4.320 ms
64
latency average = 5.064 ms
latency average = 4.934 ms
latency average = 5.020 ms
80
latency average = 4.904 ms
latency average = 4.786 ms
latency average = 4.942 ms
96
latency average = 5.392 ms
latency average = 5.376 ms
latency average = 5.367 ms
112
latency average = 5.730 ms
latency average = 5.859 ms
latency average = 5.734 ms
--
John Naylor
Amazon Web Services
Attachments:
v5-0005-Improve-CRC32C-performance-on-x86_64.patchapplication/x-patch; name=v5-0005-Improve-CRC32C-performance-on-x86_64.patchDownload+27-8
v5-0008-Allow-dev-test-to-build-on-Windows-for-CI-XXX-not.patchapplication/x-patch; name=v5-0008-Allow-dev-test-to-build-on-Windows-for-CI-XXX-not.patchDownload+1-2
v5-0006-Unroll-tail.patchapplication/x-patch; name=v5-0006-Unroll-tail.patchDownload+27-5
v5-0007-Fix-32-bit-build.patchapplication/x-patch; name=v5-0007-Fix-32-bit-build.patchDownload+2-1
v5-0004-Run-pgindent-XXX-Some-lines-are-still-really-long.patchapplication/x-patch; name=v5-0004-Run-pgindent-XXX-Some-lines-are-still-really-long.patchDownload+53-43
v5-0002-Vendor-SSE-implementation-from-https-github.com-c.patchapplication/x-patch; name=v5-0002-Vendor-SSE-implementation-from-https-github.com-c.patchDownload+77-1
v5-0003-Adjust-previous-commit-to-match-our-style-add-128.patchapplication/x-patch; name=v5-0003-Adjust-previous-commit-to-match-our-style-add-128.patchDownload+14-35
v5-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchapplication/x-patch; name=v5-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
On Thu, Feb 13, 2025 at 5:19 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote:
Well, I suspect the AVX-512 version will pretty much always need the runtime
check given that its not available on a lot of newer hardware and requires a
bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be
worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL
optimizations if built with -msse4.2 -mpclmul because we don't want to regress
existing -msse4.2 users with a runtime check.Sounds good to me. Although, users building with just -msse4.2 will now encounter an
an additional pclmul runtime check. That would be a regression unless they update to
building with both -msse4.2 and -mpclmul.My thinking was that building with just -msse4.2 would cause the existing
SSE 4.2 implementation to be used (without the function pointer). That's
admittedly a bit goofy because they'd miss out on the PCLMUL optimization,
but things at least don't get any worse for them.
I tried using branching for the runtime check, and this looks like the
way to go:
- Existing -msse4.2 builders will still call directly, but inside the
function there is a length check and only for long input will it do a
runtime check for pclmul.
- This smooths the way for -msse4.2 (and the equivalent on Arm) to
inline calls with short constant input (e.g. WAL insert lock),
although I've not done that here.
- This can be a simple starting point for consolidating runtime
checks, as was proposed for popcount in the AVX-512 CRC thread, but
with branching my model was Andres' sketch here:
/messages/by-id/20240731023918.ixsfbeuub6e76one@awork3.anarazel.de
--
John Naylor
Amazon Web Services
Attachments:
v6-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
v6-0003-Improve-CRC32C-performance-on-x86_64.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Improve-CRC32C-performance-on-x86_64.patchDownload+144-1
v6-0001-Dispatch-CRC-computation-by-branching-rather-than.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Dispatch-CRC-computation-by-branching-rather-than.patchDownload+182-73
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
I tried using branching for the runtime check, and this looks like the
way to go:
- Existing -msse4.2 builders will still call directly, but inside the
function there is a length check and only for long input will it do a
runtime check for pclmul.
- This smooths the way for -msse4.2 (and the equivalent on Arm) to
inline calls with short constant input (e.g. WAL insert lock),
although I've not done that here.
- This can be a simple starting point for consolidating runtime
checks, as was proposed for popcount in the AVX-512 CRC thread, but
with branching my model was Andres' sketch here:/messages/by-id/20240731023918.ixsfbeuub6e76one@awork3.anarazel.de
Oh, nifty. IIUC this should help avoid quite a bit of overhead even before
adding the PCLMUL stuff since it removes the function pointers for the
runtime-check builds.
While this needn't block this patch set, I do find the dispatch code to be
pretty complicated. Maybe we can improve that in the future by using
macros to auto-generate much of it. My concern here is less about this
particular patch set and more about the long term maintainability as we add
more and more stuff like it, each with its own tangled web of build and
dispatch rules.
--
nathan
On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
I tried using branching for the runtime check, and this looks like the
way to go:
- Existing -msse4.2 builders will still call directly, but inside the
function there is a length check and only for long input will it do a
runtime check for pclmul.
- This smooths the way for -msse4.2 (and the equivalent on Arm) to
inline calls with short constant input (e.g. WAL insert lock),
although I've not done that here.
- This can be a simple starting point for consolidating runtime
checks, as was proposed for popcount in the AVX-512 CRC thread, but
with branching my model was Andres' sketch here:/messages/by-id/20240731023918.ixsfbeuub6e76one@awork3.anarazel.de
Oh, nifty. IIUC this should help avoid quite a bit of overhead even before
adding the PCLMUL stuff since it removes the function pointers for the
runtime-check builds.
I figured the same, but it doesn't seem to help on the microbenchmark.
(I also tried the pg_waldump benchmark you used, but I couldn't get it
working so I'm probably missing a step.)
master:
20
latency average = 3.958 ms
latency average = 3.860 ms
latency average = 3.857 ms
32
latency average = 5.921 ms
latency average = 5.166 ms
latency average = 5.128 ms
48
latency average = 6.384 ms
latency average = 6.022 ms
latency average = 5.991 ms
v6:
20
latency average = 4.084 ms
latency average = 3.879 ms
latency average = 3.896 ms
32
latency average = 5.533 ms
latency average = 5.536 ms
latency average = 5.573 ms
48
latency average = 6.201 ms
latency average = 6.205 ms
latency average = 6.111 ms
While this needn't block this patch set, I do find the dispatch code to be
pretty complicated. Maybe we can improve that in the future by using
macros to auto-generate much of it. My concern here is less about this
particular patch set and more about the long term maintainability as we add
more and more stuff like it, each with its own tangled web of build and
dispatch rules.
I think the runtime dispatch is fairly simple for this case. To my
taste, the worse maintainability hazard is the building. To make that
better, I'd do this:
- Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build
them unconditionally for each platform
- Initialize the runtime info by CPU platform and not other symbols
where possible (I guess anything needing AVX-512 will still be a mess)
- Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which
intrinsic to use by platform and have a separate pg_crc32c_hw_impl.h
header that has the simple loop with these macros. That header would
for now only be included into pg_crc32c_hw.c.
The first two could be done as part of this patch or soon after. The
third is a bit more invasive, but seems like a necessary prerequisite
for inlining on small constant input, to keep that from being a mess.
There's another potential irritation for maintenance too: I spent too
much time only adding pg_cpucap_initialize() to frontend main()
functions that need it. I should have just added them to every binary.
We don't add new programs often, but it's still less automatic than
the function pointer way, so I'm open to other ideas.
Attached v7 to fix CI failures.
--
John Naylor
Amazon Web Services
Attachments:
v7-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patchDownload+167-1
v7-0003-Improve-CRC32C-performance-on-x86_64.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Improve-CRC32C-performance-on-x86_64.patchDownload+151-2
v7-0001-Dispatch-CRC-computation-by-branching-rather-than.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Dispatch-CRC-computation-by-branching-rather-than.patchDownload+182-73
Hi John,
I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain from posting a patch here). Could you take a look at the approach here to see if this makes sense? The logic in the pg_comp_crc32c_choose function is the main change and seems a lot cleaner to me.
See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/
Raghuveer
Show quoted text
-----Original Message-----
From: John Naylor <johncnaylorls@gmail.com>
Sent: Monday, February 17, 2025 10:40 PM
To: Nathan Bossart <nathandbossart@gmail.com>
Cc: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>; pgsql-
hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
Subject: Re: Improve CRC32C performance on SSE4.2On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
I tried using branching for the runtime check, and this looks like
the way to go:
- Existing -msse4.2 builders will still call directly, but inside
the function there is a length check and only for long input will it
do a runtime check for pclmul.
- This smooths the way for -msse4.2 (and the equivalent on Arm) to
inline calls with short constant input (e.g. WAL insert lock),
although I've not done that here.
- This can be a simple starting point for consolidating runtime
checks, as was proposed for popcount in the AVX-512 CRC thread, but
with branching my model was Andres' sketch here:/messages/by-id/20240731023918.ixsfbeuub6e76on
e%40awork3.anarazel.deOh, nifty. IIUC this should help avoid quite a bit of overhead even
before adding the PCLMUL stuff since it removes the function pointers
for the runtime-check builds.I figured the same, but it doesn't seem to help on the microbenchmark.
(I also tried the pg_waldump benchmark you used, but I couldn't get it working so
I'm probably missing a step.)master:
20
latency average = 3.958 ms
latency average = 3.860 ms
latency average = 3.857 ms
32
latency average = 5.921 ms
latency average = 5.166 ms
latency average = 5.128 ms
48
latency average = 6.384 ms
latency average = 6.022 ms
latency average = 5.991 msv6:
20
latency average = 4.084 ms
latency average = 3.879 ms
latency average = 3.896 ms
32
latency average = 5.533 ms
latency average = 5.536 ms
latency average = 5.573 ms
48
latency average = 6.201 ms
latency average = 6.205 ms
latency average = 6.111 msWhile this needn't block this patch set, I do find the dispatch code
to be pretty complicated. Maybe we can improve that in the future by
using macros to auto-generate much of it. My concern here is less
about this particular patch set and more about the long term
maintainability as we add more and more stuff like it, each with its
own tangled web of build and dispatch rules.I think the runtime dispatch is fairly simple for this case. To my taste, the worse
maintainability hazard is the building. To make that better, I'd do this:
- Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
unconditionally for each platform
- Initialize the runtime info by CPU platform and not other symbols where possible
(I guess anything needing AVX-512 will still be a mess)
- Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use
by platform and have a separate pg_crc32c_hw_impl.h header that has the
simple loop with these macros. That header would for now only be included into
pg_crc32c_hw.c.The first two could be done as part of this patch or soon after. The third is a bit
more invasive, but seems like a necessary prerequisite for inlining on small
constant input, to keep that from being a mess.There's another potential irritation for maintenance too: I spent too much time
only adding pg_cpucap_initialize() to frontend main() functions that need it. I
should have just added them to every binary.
We don't add new programs often, but it's still less automatic than the function
pointer way, so I'm open to other ideas.Attached v7 to fix CI failures.
--
John Naylor
Amazon Web Services
Hi John,
Just a few comments on v7:
pg_cpucap_crc32c
I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy to extend in the future and keeps the functionalities separate.
- Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
unconditionally for each platform
+1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed before main.
I think the runtime dispatch is fairly simple for this case.
+ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
+ {
+ ....
+ #ifdef HAVE_PCLMUL_RUNTIME
+ if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))
IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of the pg_crc32c.h to it own pg_crc32c_dispatch.c file.
Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy to provide feedback and review your work, or can continue development work based on any feedback. Please let me know which you'd prefer.
Raghuveer
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed before main.
Hmm! I didn't know about that, thanks! It works on old gcc/clang, so
that's good. I can't verify online if Arm etc. executes properly since
the execute button is greyed out, but I don't see why it wouldn't:
https://godbolt.org/z/3as9MGr3G
Then we could have:
#ifdef FRONTEND
pg_attribute_constructor()
#endif
void
pg_cpucap_initialize(void)
{
...
}
Now, there is no equivalent on MSVC and workarounds are fragile [1]https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc.
Maybe we could only assert initialization happened for the backend and
for frontend either
- add a couple manual initializations to to the frontend programs
where we don't want to lose performance for non-gcc/clang.
- require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by
Thomas M.'s earlier findings on popcount (also SSE4.2) [2]/messages/by-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com
The first is less risky but less tidy.
I think the runtime dispatch is fairly simple for this case.
+ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) + { + .... + #ifdef HAVE_PCLMUL_RUNTIME + if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of the pg_crc32c.h to it own pg_crc32c_dispatch.c file.
That makes sense, but it does close the door on future inlining.
Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy to provide feedback and review your work, or can continue development work based on any feedback. Please let me know which you'd prefer.
Sure. Depending on any feedback on the above constructor technique,
I'll work on the following, since I've prototyped most of the first
and the second is mostly copy-and-paste from your earlier work:
pg_cpucap_crc32c
I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy to extend in the future and keeps the functionalities separate.
- Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
unconditionally for each platform+1. Sounds perfect. We should also move the avx512 runtime detection of popcount here.
[1]: https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc
[2]: /messages/by-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com
--
John Naylor
Amazon Web Services