CRC32C Parallel Computation Optimization on ARM
Hi all
This patch uses a parallel computing optimization algorithm to improve crc32c computing performance on ARM. The algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three equal-sized blocks.Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes
Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.
I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Attachments:
0001-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0001-crc32c-parallel-computation-optimization-on-arm.patchDownload+259-13
On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
This patch uses a parallel computing optimization algorithm to
improve crc32c computing performance on ARM. The algorithm comes
from Intel whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytesCrc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.
Interesting. Could you attached to this thread the test files you
used and the results obtained please? If this data gets deleted from
github, then it would not be possible to refer back to what you did at
the related benchmark results.
Note that your patch is forgetting about meson; it just patches
./configure.
--
Michael
On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote:
On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
This patch uses a parallel computing optimization algorithm to
improve crc32c computing performance on ARM. The algorithm comes
from Intel whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytesCrc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.Interesting. Could you attached to this thread the test files you
used and the results obtained please? If this data gets deleted from
github, then it would not be possible to refer back to what you did at
the related benchmark results.Note that your patch is forgetting about meson; it just patches
./configure.
I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.
However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]/messages/by-id/ec487192-f6aa-509a-cacb-6642dad14209@iki.fi. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.
[0]: /messages/by-id/ec487192-f6aa-509a-cacb-6642dad14209@iki.fi
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.
Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 25/10/2023 00:18, Nathan Bossart wrote:
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0]. I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*
No need to guess, pg_waldump -z will tell you what the record size is.
And you can vary it by changing the checkpoint interval and/or pgbench
scale factor: if you checkpoint frequently or if the database is larger,
you get more full-page images which makes the records larger on average,
and vice versa.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote:
On 25/10/2023 00:18, Nathan Bossart wrote:
Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*No need to guess, pg_waldump -z will tell you what the record size is. And
you can vary it by changing the checkpoint interval and/or pgbench scale
factor: if you checkpoint frequently or if the database is larger, you get
more full-page images which makes the records larger on average, and vice
versa.
If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records? At least it would
make the results more predictible.
--
Michael
On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote:
If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records? At least it would
make the results more predictible.
I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).
Given the performance characteristics and relative simplicity of the patch,
I think this could be worth doing. I suspect we'll want to do something
similar for x86, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Thanks for your suggestion, this is the modified patch and two test files.
-----Original Message-----
From: Michael Paquier <michael@paquier.xyz>
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao <Xiang.Gao@arm.com>
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM
On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
This patch uses a parallel computing optimization algorithm to improve
crc32c computing performance on ARM. The algorithm comes from Intel
whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
into three equal-sized blocks.Three parallel blocks (crc0, crc1,
crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
crc32c_u64) bytesCrc32c unitest:
https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark:
https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.
Interesting. Could you attached to this thread the test files you used and the results obtained please? If this data gets deleted from github, then it would not be possible to refer back to what you did at the related benchmark results.
Note that your patch is forgetting about meson; it just patches ./configure.
--
Michael
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
It looks like most of this function is duplicated from
pg_comp_crc32c_armv8(). I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used. It's callers would then just provide a boolean to indicate which
branch to take.
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+ if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+ USE_ARMV8_VMULL=1
+ fi
+fi
Hm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.
Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS. The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:
+pg_crc32c +pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
It looks like most of this function is duplicated from
pg_comp_crc32c_armv8(). I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used. It's callers would then just provide a boolean to indicate which
branch to take.
I have modified and remade the patch.
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too. +if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then + if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then + USE_ARMV8_VMULL=1 + fi +fi
Hm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.
We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS. The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).
Yes, I think so. USE_ARMV8_CRC32C only means that the compilation is successful,
and it does not guarantee that it can run correctly on the local machine.
Therefore, a runtime check is required during actual operation.
Based on the principle of minimal changes, we plan to fix it in the next patch.
If the community agrees, we will continue to improve it later, such as merging x86 and arm code, etc.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Attachments:
0003-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0003-crc32c-parallel-computation-optimization-on-arm.patchDownload+246-15
On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).
Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao <Xiang.Gao@arm.com> wrote:
On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!
Here's a script that I use to generate WAL records of various sizes,
change it to taste if useful:
for m in 16 64 256 1024 4096 8192 16384
do
echo "Start of run with WAL size \$m bytes at:"
date
echo "SELECT pg_logical_emit_message(true, 'mymessage',
repeat('d', \$m));" >> $JUMBO/scripts/dumbo\$m.sql
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096
do
$PGWORKSPACE/pgbench -n postgres -c\$c -j\$c -T60 -f
$JUMBO/scripts/dumbo\$m.sql > $JUMBO/results/dumbo\$m:\$c.out
done
echo "End of run with WAL size \$m bytes at:"
date
echo "\n"
done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Thu, Oct 26, 2023 at 07:28:35AM +0000, Xiang Gao wrote:
On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:
+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too. +if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then + if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then + USE_ARMV8_VMULL=1 + fi +fiHm. I wonder if we need to switch to a runtime check in some cases. For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added? It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Oct 26, 2023 at 08:53:31AM +0000, Xiang Gao wrote:
On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!
I did something like
do $$
begin
for i in 1..1000000
loop
perform pg_logical_emit_message(false, 'test', repeat('0123456789', 800));
end loop;
end;
$$;
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.
Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Attachments:
0004-crc32c-parallel-computation-optimization-on-arm.patchapplication/octet-stream; name=0004-crc32c-parallel-computation-optimization-on-arm.patchDownload+288-148
On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
Thanks. I went ahead and split this prerequisite part out to a separate
thread [0]/messages/by-id/20231030161706.GA3011@nathanxps13 since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.
[0]: /messages/by-id/20231030161706.GA3011@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Oct 30, 2023 at 11:21:43AM -0500, Nathan Bossart wrote:
On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.Oh. Looking again, I see that we are using a runtime check for ARM in all
cases with this patch. If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check). I suspect there are other
opportunities to simplify things, too.Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
Thanks. I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.
Per the other thread [0]/messages/by-id/2620794.1698783160@sss.pgh.pa.us, we should try to avoid the runtime check when
possible, as it seems to produce a small regression. This means that if
the ARMv8 CRC instructions are found with the default compiler flags, we
can only use vmull_p64() if it can also be used with the default flags.
Otherwise, we can just do the runtime check.
[0]: /messages/by-id/2620794.1698783160@sss.pgh.pa.us
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote:
Thanks. I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here. It's not
really a prerequisite, but I do think it will simplify things a bit.
Per the other thread [0], we should try to avoid the runtime check when
possible, as it seems to produce a small regression. This means that if
the ARMv8 CRC instructions are found with the default compiler flags, we
can only use vmull_p64() if it can also be used with the default flags.
Otherwise, we can just do the runtime check.
After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
if so, we need to check whether to use the ARM CRC32C and VMULL instruction
directly or with runtime check. There will be many scenarios here and the code
will be more complicated.
Could you please give me some suggestions about how to refine this patch?
Thanks very much!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:
After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags. I
suspect this limitation sounds worse than it actually is in practice. The
vast majority of the buildfarm uses runtime checks, and at least some of
the platforms that don't, such as the Apple M-series machines, seem to
include +crypto by default.
Of course, if a compiler picks up +crc but not +crypto in its defaults, we
could lose the vmull_p64() optimization on that platform. But as noted in
the other thread, we can revisit if these kinds of hypothetical situations
become reality.
Could you please give me some suggestions about how to refine this patch?
Of course. I think we'll ultimately want to independently check for the
availability of the new instruction like we do for the other sets of
intrinsics:
PGAC_ARMV8_VMULL_INTRINSICS([])
if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
fi
My current thinking is that we'll want to add USE_ARMV8_VMULL and
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
compile. I'll admit I haven't fully thought through every detail yet, but
I'm cautiously optimistic that we can avoid too much complexity in the
autoconf/meson scripts.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:
On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:
After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks. IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags. I
suspect this limitation sounds worse than it actually is in practice. The
vast majority of the buildfarm uses runtime checks, and at least some of
the platforms that don't, such as the Apple M-series machines, seem to
include +crypto by default.
Of course, if a compiler picks up +crc but not +crypto in its defaults, we
could lose the vmull_p64() optimization on that platform. But as noted in
the other thread, we can revisit if these kinds of hypothetical situations
become reality.
Could you please give me some suggestions about how to refine this patch?
Of course. I think we'll ultimately want to independently check for the
availability of the new instruction like we do for the other sets of
intrinsics:PGAC_ARMV8_VMULL_INTRINSICS([])
if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
fiMy current thinking is that we'll want to add USE_ARMV8_VMULL and
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
compile. I'll admit I haven't fully thought through every detail yet, but
I'm cautiously optimistic that we can avoid too much complexity in the
autoconf/meson scripts.
Thank you so much!
This is the newest patch, I think the code for which crc algorithm to choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only, and do runtime checks on the vmull_p64 instruction at all times. This will not affect the existing builds, because this is a new instruction and new logic. In addition, it can also reduce the complexity of the code.
Very much looking forward to receiving your suggestions, thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.