always use runtime checks for CRC-32C instructions

Started by Nathan Bossartover 2 years ago13 messageshackers

nathandbossart@gmail.com

over 2 years ago

This is an offshoot of the "CRC32C Parallel Computation Optimization on
ARM" thread [0]/messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com. I intend for this to be a prerequisite patch set.

Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C
code for WAL records, etc., we first check if the intrinsics are available
with the default compiler flags. If so, we only bother compiling the
implementation that uses those intrinsics. If not, we also check whether
the intrinsics are available with some extra CFLAGS, and if they are, we
compile both the implementation that uses the intrinsics as well as a
fallback implementation that doesn't require any special instructions.
Then, at runtime, we check what's available in the hardware and choose the
appropriate CRC32C implementation.

The aforementioned other thread [0]/messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks. $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification. Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.

AFAICT the trade-offs aren't too bad. With some simple testing, I see that
the runtime check occurs once at startup, so I don't anticipate any
noticeable performance impact. I suppose each process might need to do the
check in EXEC_BACKEND builds, but even so, I suspect the difference is
negligible.

I also see that the SSE 4.2 runtime check requires the CPUID instruction,
so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but
not CPUID. However, I'm not sure such hardware even exists. Wikipedia
says that CPUID was introduced in 1993 [1]https://en.wikipedia.org/wiki/CPUID, and meson.build appears to omit
the CPUID check when determining which CRC32C implementation to use.
Furthermore, meson.build alludes to problems with some of the CPUID-related
checks:

# XXX: The configure.ac check for __cpuid() is broken, we don't copy that
# here. To prevent problems due to two detection methods working, stop
# checking after one.

Are there any other reasons that we should try to avoid the runtime check
when possible?

I've attached two patches. 0001 adds a debug message to the SSE 4.2
runtime check that matches the one already present for the ARMv8 check.
This message just notes whether the runtime check found that the special
CRC instructions are available. 0002 is a first attempt at $SUBJECT. I've
tested it on both x86 and ARM, and it seems to work as intended. You'll
notice that I'm still checking for the intrinsics with the default compiler
flags first. I didn't see any strong reason to change this, and doing so
allows us to avoid sending extra CFLAGS when possible.

Thoughts?

[0]: /messages/by-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com
[1]: https://en.wikipedia.org/wiki/CPUID

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Nathan Bossart (#1)

Re: always use runtime checks for CRC-32C instructions

Nathan Bossart <nathandbossart@gmail.com> writes:

The aforementioned other thread [0] aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks. $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification. Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.

On the one hand, I agree that we need to keep the complexity from
getting out of hand. On the other hand, I wonder if this approach
isn't optimizing for the wrong case. How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)? It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions. Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.

I'm not quite sure what to propose instead, though. I thought for
a little bit about a configure switch to select "test first" or
"pedal to the metal". But in practice, package builders would
probably have to select the conservative "test first" option; and
we know that the vast majority of modern installations use prebuilt
packages, so it's not clear that this answer would help many people.

Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about. If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.

regards, tom lane

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Tom Lane (#2)

Re: always use runtime checks for CRC-32C instructions

On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote:

It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions.

It's already doing a call to pg_comp_crc32c_sse42() regardless, right?

I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?

Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?

Regards,
Jeff Davis

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Tom Lane (#2)

Re: always use runtime checks for CRC-32C instructions

On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote:

On the one hand, I agree that we need to keep the complexity from
getting out of hand. On the other hand, I wonder if this approach
isn't optimizing for the wrong case. How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)?

For the CRC instructions in use today, I wouldn't be surprised if that
number is pretty small, but for newer or optional instructions (like ARM's
PMULL), I don't think we'll be so lucky. Even if we do feel comfortable
assuming the presence of SSE 4.2, etc., we'll likely still need to add
runtime checks for future optimizations.

It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions. Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.

When I went to measure this, I noticed that my relatively new x86 machine
with a relatively new version of gcc uses the runtime check. I then
skimmed through a few dozen buildfarm machines and found that, of all x86
and ARM machines that supported the specialized CRC instructions, only one
ARM machine did not use the runtime check. Of course, this is far from a
scientific data point, but it seems to indicate that the runtime check is
the norm.

(I still need to measure it.)

Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about. If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.

Will do.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Jeff Davis (#3)

Re: always use runtime checks for CRC-32C instructions

On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote:

I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?

Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?

I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:

* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38s

Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Nathan Bossart (#5)

Re: always use runtime checks for CRC-32C instructions

On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:

I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:

* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38s

Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.

After reflecting on these numbers for a bit, I think I'm still inclined to
do $SUBJECT. I considered the following:

* While it would be nice to gain a couple of percentage points for existing
hardware, I think we'll still end up doing runtime checks most of the
time once we add support for newer instructions.

* The performance improvements that the new instructions provide seem
likely to outweigh these small regressions, especially for workloads with
larger WAL records [0]/messages/by-id/20231025014539.GA977906@nathanxps13.

* From my quick scan of a few dozen machines on the buildfarm, it looks
like the runtime checks are already the norm, so the number of systems
that would be subject to a regression from v16 to v17 should be pretty
small, in theory. And this regression seems to be on the order of 1%
based on the numbers above.

Do folks think this is reasonable? Or should we instead try to squeeze
every last drop out of the current implementations by avoiding function
pointers, forcing inlining, etc.?

[0]: /messages/by-id/20231025014539.GA977906@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Nathan Bossart (#6)

Re: always use runtime checks for CRC-32C instructions

Nathan Bossart <nathandbossart@gmail.com> writes:

On Mon, Oct 30, 2023 at 10:36:01PM -0500, Nathan Bossart wrote:

I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:

* slicing-by-8 : ~3.08s
* proposed patches applied (runtime check) : ~2.44s
* only CRC intrinsics implementation compiled : ~2.42s
* forced inlining : ~2.38s

Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement. In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.

I find these numbers fairly concerning. If you can see a
couple-of-percent slowdown on a macroscopic benchmark like pg_waldump,
that implies that the percentage slowdown considering the CRC
operation alone is much worse. So there may be other use-cases where
we would take a bigger relative hit.

* From my quick scan of a few dozen machines on the buildfarm, it looks
like the runtime checks are already the norm, so the number of systems
that would be subject to a regression from v16 to v17 should be pretty
small, in theory. And this regression seems to be on the order of 1%
based on the numbers above.

I did a more thorough scrape of the buildfarm results. Of 161 animals
currently reporting configure output on HEAD, we have

2 ARMv8 CRC instructions
36 ARMv8 CRC instructions with runtime check
2 LoongArch CRCC instructions
2 SSE 4.2
52 SSE 4.2 with runtime check
67 slicing-by-8

While that'd seem to support your conclusion, the two using ARM CRC
*without* a runtime check are my Apple M1 Mac animals (sifaka/indri);
and I see the same selection on my laptop. So one platform where
we'd clearly be taking a regression is M-series Macs; that's a pretty
popular platform. The two using SSE without a check are prion and
tayra. I notice those are using gcc 11; so perhaps the default cflags
have changed to include -msse4.2 recently? I couldn't see much other
pattern though. (Scraping results attached in case anybody wants to
look.)

Really this just reinforces my concern that doing a runtime check
all the time is on the wrong side of history. I grant that we've
got to do that for anything where the availability of the instruction
is really in serious question, but I'm not very convinced that that's
a majority situation on popular platforms.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Tom Lane (#7)

Re: always use runtime checks for CRC-32C instructions

I wrote:

I did a more thorough scrape of the buildfarm results. Of 161 animals
currently reporting configure output on HEAD, we have

Oh ... take "current" with a grain of salt there, because I just noticed
that I typo'd my search condition so that it collected results from all
systems that reported since 2022-Oct, rather than in the last month as
I'd intended. There are just 137 animals currently reporting.

Of those, I broke down the architectures reporting using slicing-by-8:

The one machine using slicing-by-8 where there might be a better
alternative is arowana, which is CentOS 7 with a pretty ancient gcc
version. So I reject the idea that slicing-by-8 is an appropriate
baseline for comparisons. There isn't anybody who will see an
improvement over current behavior: in the population of interest,
just about all platforms are using CRC instructions with or without
a runtime check.

regards, tom lane

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Tom Lane (#7)

Re: always use runtime checks for CRC-32C instructions

On Tue, Oct 31, 2023 at 03:16:16PM -0400, Tom Lane wrote:

Really this just reinforces my concern that doing a runtime check
all the time is on the wrong side of history. I grant that we've
got to do that for anything where the availability of the instruction
is really in serious question, but I'm not very convinced that that's
a majority situation on popular platforms.

Okay. With that in mind, I think the path forward for new instructions is
as follows:

* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)

Any platform that can use the CRC instructions with default compiler flags
but not the new instructions wouldn't be able to take advantage of the
proposed optimization, but it also wouldn't be subject to the small
performance regression.

If we wanted to further eliminate runtime checks for SSE 4.2 and ARMv8,
then I think things become a little trickier, as having a compiler that
understands things like +crypto would mean that you're automatically
subject to the runtime check regression (assuming we proceed with the
proposed optimization). An alternate approach could be to only use newer
instructions if they are available with the default compiler flags, but
given the current state of the buildfarm, such optimizations might not get
much uptake for a while.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#10

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Tom Lane (#8)

Re: always use runtime checks for CRC-32C instructions

On Tue, Oct 31, 2023 at 03:42:33PM -0400, Tom Lane wrote:

The one machine using slicing-by-8 where there might be a better
alternative is arowana, which is CentOS 7 with a pretty ancient gcc
version. So I reject the idea that slicing-by-8 is an appropriate
baseline for comparisons. There isn't anybody who will see an
improvement over current behavior: in the population of interest,
just about all platforms are using CRC instructions with or without
a runtime check.

I only included the slicing-by-8 benchmark to demonstrate that 1) the CRC
computations are a big portion of that pg_waldump -z command and that 2)
the CRC instructions provide significant performance gains.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#11

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Nathan Bossart (#9)

Re: always use runtime checks for CRC-32C instructions

Nathan Bossart <nathandbossart@gmail.com> writes:

Okay. With that in mind, I think the path forward for new instructions is
as follows:

* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)

This seems like a reasonable proposal.

Any platform that can use the CRC instructions with default compiler flags
but not the new instructions wouldn't be able to take advantage of the
proposed optimization, but it also wouldn't be subject to the small
performance regression.

Check. For now I think that's fine. If we get to a place where this
policy is really leaving a lot of performance on the table, we can
revisit it ... but that situation is hypothetical and may remain so.

(It's worth noting also that a package builder can move the goalposts
at will, since our idea of "default flags" is really whatever the user
says to use.)

regards, tom lane

#12

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Tom Lane (#11)

Re: always use runtime checks for CRC-32C instructions

On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Okay. With that in mind, I think the path forward for new instructions is
as follows:

* If the special CRC instructions can be used with the default compiler
flags, we can only use newer instructions if they can also be used with
the default compiler flags. (My M2 machine appears to add +crypto by
default, so I bet your buildfarm animals would fall into this bucket.)
* Otherwise, if the CRC instructions can be used with added flags (i.e.,
the runtime check path), we can do a runtime check for the new
instructions as well. (Most other buildfarm animals would fall into this
bucket.)

This seems like a reasonable proposal.

Great. I think that leaves us with nothing left to do for this thread, so
I'll withdraw it from the commitfest and move the discussion back to the
original thread.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#13

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Nathan Bossart (#12)

Re: always use runtime checks for CRC-32C instructions

On Tue, Oct 31, 2023 at 03:38:17PM -0500, Nathan Bossart wrote:

On Tue, Oct 31, 2023 at 04:12:40PM -0400, Tom Lane wrote:

This seems like a reasonable proposal.

Great. I think that leaves us with nothing left to do for this thread, so
I'll withdraw it from the commitfest and move the discussion back to the
original thread.

(Also, thanks for the discussion.)

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

always use runtime checks for CRC-32C instructions

Attachments:

Attachments: