use ARM intrinsics in pg_lfind32() where available

Started by Nathan Bossartalmost 4 years ago29 messageshackers

nathandbossart@gmail.com

almost 4 years ago

Hi hackers,

This is a follow-up for recent changes that optimized [sub]xip lookups in
XidInMVCCSnapshot() on Intel hardware [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b6ef167 [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37a6e5d. I've attached a patch that
uses ARM Advanced SIMD (Neon) intrinsic functions where available to speed
up the search. The approach is nearly identical to the SSE2 version, and
the usual benchmark [2]/messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com shows similar improvements.

writers head simd
8 866 836
16 849 833
32 782 822
64 846 833
128 805 821
256 722 739
512 529 674
768 374 608
1024 268 522

I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
(Graviton2), and I've confirmed that the instructions aren't used on a
Linux/Intel machine. I did add a new configure check to see if the
relevant intrinsics are available, but I didn't add a runtime check like
there is for the CRC instructions since the compilers I used support these
intrinsics by default. (I don't think a runtime check would work very well
with the inline function, anyway.) AFAICT these intrinsics are pretty
standard on aarch64, although IIUC the spec indicates that they are
technically optional. I suspect that a simple check for "aarch64" would be
sufficient, but I haven't investigated the level of compiler support yet.

Thoughts?

[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b6ef167
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37a6e5d
[2]: /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Nathan Bossart (#1)

Re: use ARM intrinsics in pg_lfind32() where available

Hi,

On 2022-08-19 13:08:29 -0700, Nathan Bossart wrote:

I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
(Graviton2), and I've confirmed that the instructions aren't used on a
Linux/Intel machine. I did add a new configure check to see if the
relevant intrinsics are available, but I didn't add a runtime check like
there is for the CRC instructions since the compilers I used support these
intrinsics by default. (I don't think a runtime check would work very well
with the inline function, anyway.) AFAICT these intrinsics are pretty
standard on aarch64, although IIUC the spec indicates that they are
technically optional. I suspect that a simple check for "aarch64" would be
sufficient, but I haven't investigated the level of compiler support yet.

Are you sure there's not an appropriate define for us to use here instead of a
configure test? E.g.

echo|cc -dM -P -E -|grep -iE 'arm|aarch'
...
#define __AARCH64_SIMD__ 1
...
#define __ARM_NEON 1
#define __ARM_NEON_FP 0xE
#define __ARM_NEON__ 1
..

I strikes me as non-scalable to explicitly test all the simd instructions we'd
use.

The story for the CRC checks is different because those instructions often
aren't available with the default compilation flags and aren't guaranteed to
be available at runtime.

Regards,

Andres

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: Andres Freund (#2)

Re: use ARM intrinsics in pg_lfind32() where available

On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:

Are you sure there's not an appropriate define for us to use here instead of a
configure test? E.g.

echo|cc -dM -P -E -|grep -iE 'arm|aarch'
...
#define __AARCH64_SIMD__ 1
...
#define __ARM_NEON 1
#define __ARM_NEON_FP 0xE
#define __ARM_NEON__ 1
..

I strikes me as non-scalable to explicitly test all the simd instructions we'd
use.

Thanks for the pointer. GCC, Clang, and the Arm compiler all seem to
define __ARM_NEON, so here is a patch that uses that instead.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#3)

Re: use ARM intrinsics in pg_lfind32() where available

On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:

Are you sure there's not an appropriate define for us to use here instead of a
configure test? E.g.

echo|cc -dM -P -E -|grep -iE 'arm|aarch'
...
#define __AARCH64_SIMD__ 1
...
#define __ARM_NEON 1
#define __ARM_NEON_FP 0xE
#define __ARM_NEON__ 1
..

I strikes me as non-scalable to explicitly test all the simd instructions we'd
use.

Thanks for the pointer. GCC, Clang, and the Arm compiler all seem to
define __ARM_NEON, so here is a patch that uses that instead.

Is this also ever defined on 32-bit? If so, is it safe, meaning the
compiler will not emit these instructions without additional flags?
I'm wondering if __aarch64__ would be clearer on that, and if we get
windows-on-arm support as has been proposed, could also add _M_ARM64.

I also see #if defined(__aarch64__) || defined(__aarch64) in our
codebase already, but I'm not sure what recognizes the latter.

--
John Naylor
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#4)

Re: use ARM intrinsics in pg_lfind32() where available

On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:

On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Thanks for the pointer. GCC, Clang, and the Arm compiler all seem to
define __ARM_NEON, so here is a patch that uses that instead.

Is this also ever defined on 32-bit? If so, is it safe, meaning the
compiler will not emit these instructions without additional flags?
I'm wondering if __aarch64__ would be clearer on that, and if we get
windows-on-arm support as has been proposed, could also add _M_ARM64.

I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
possible, we should probably add an __aarch64__ check since functions like
vmaxvq_u32() do not appear to be available on 32-bit. I have been able to
compile for __aarch64__ without __ARM_NEON, so it might still be a good
idea to check for __ARM_NEON. So, to be safe, perhaps we should use
something like the following:

#if (defined(__aarch64__) || defined(__aarch64)) && defined(__ARM_NEON)

I also see #if defined(__aarch64__) || defined(__aarch64) in our
codebase already, but I'm not sure what recognizes the latter.

I'm not sure what uses the latter, either.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#5)

Re: use ARM intrinsics in pg_lfind32() where available

On Tue, Aug 23, 2022 at 4:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:

Is this also ever defined on 32-bit? If so, is it safe, meaning the
compiler will not emit these instructions without additional flags?
I'm wondering if __aarch64__ would be clearer on that, and if we get
windows-on-arm support as has been proposed, could also add _M_ARM64.

I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
possible, we should probably add an __aarch64__ check since functions like
vmaxvq_u32() do not appear to be available on 32-bit. I have been able to
compile for __aarch64__ without __ARM_NEON, so it might still be a good
idea to check for __ARM_NEON.

The important thing is: if we compile with __aarch64__ as a target:
- Will the compiler emit the intended instructions from the intrinsics
without extra flags?
- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions? "I have been able to compile for
__aarch64__ without __ARM_NEON" doesn't really answer that question --
what exactly did this entail?

I also see #if defined(__aarch64__) || defined(__aarch64) in our
codebase already, but I'm not sure what recognizes the latter.

I'm not sure what uses the latter, either.

I took a quick look around at Debian code search, *BSD, Apple, and a
few other places, and I can't find it. Then, I looked at the
discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
support to s_lock.h", and the proposed patch [1]/messages/by-id/1368448758.23422.12.camel@t520.redhat.com only had __aarch64__
. When it was committed, the platform was vaporware and I suppose we
included "__aarch64" as a prophylactic measure because no other reason
was given. It doesn't seem to exist anywhere, so unless someone can
demonstrate otherwise, I'm going to rip it out soon.

[1]: /messages/by-id/1368448758.23422.12.camel@t520.redhat.com

--
John Naylor
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#6)

Re: use ARM intrinsics in pg_lfind32() where available

On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:

The important thing is: if we compile with __aarch64__ as a target:
- Will the compiler emit the intended instructions from the intrinsics
without extra flags?

My testing with GCC and Clang did not require any extra flags. GCC appears
to enable it by default for aarch64 [0]https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html. AFAICT this is the case for Clang
as well, but that is based on the code and my testing (I couldn't find any
documentation for this).

- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

"I have been able to compile for
__aarch64__ without __ARM_NEON" doesn't really answer that question --
what exactly did this entail?

Compiling with something like -march=armv8-a+nosimd prevents defining
__ARM_NEON. Interestingly, Clang still defines __ARM_NEON__ even when
+nosimd is specified.

I took a quick look around at Debian code search, *BSD, Apple, and a
few other places, and I can't find it. Then, I looked at the
discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
support to s_lock.h", and the proposed patch [1] only had __aarch64__
. When it was committed, the platform was vaporware and I suppose we
included "__aarch64" as a prophylactic measure because no other reason
was given. It doesn't seem to exist anywhere, so unless someone can
demonstrate otherwise, I'm going to rip it out soon.

This is what I found, too, so +1. I've attached a patch for this.

[0]: https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#7)

Re: use ARM intrinsics in pg_lfind32() where available

On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:

The important thing is: if we compile with __aarch64__ as a target:
- Will the compiler emit the intended instructions from the intrinsics
without extra flags?

My testing with GCC and Clang did not require any extra flags. GCC appears
to enable it by default for aarch64 [0]. AFAICT this is the case for Clang
as well, but that is based on the code and my testing (I couldn't find any
documentation for this).

I guess you meant this part: "‘simd’ Enable Advanced SIMD
instructions. This also enables floating-point instructions. This is
on by default for all possible values for options -march and -mcpu."

- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

Given the quoted part above, it doesn't seem likely, but we should try
to find out for sure, because a runtime fault is surely not acceptable
even on a toy system.

"I have been able to compile for
__aarch64__ without __ARM_NEON" doesn't really answer that question --
what exactly did this entail?

Compiling with something like -march=armv8-a+nosimd prevents defining
__ARM_NEON.

Okay, that's unsurprising.

Interestingly, Clang still defines __ARM_NEON__ even when
+nosimd is specified.

POLA violation, but if no one has complained to them, it's a good bet
the instructions are always available.

I took a quick look around at Debian code search, *BSD, Apple, and a
few other places, and I can't find it. Then, I looked at the
discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
support to s_lock.h", and the proposed patch [1] only had __aarch64__
. When it was committed, the platform was vaporware and I suppose we
included "__aarch64" as a prophylactic measure because no other reason
was given. It doesn't seem to exist anywhere, so unless someone can
demonstrate otherwise, I'm going to rip it out soon.

This is what I found, too, so +1. I've attached a patch for this.

Thanks, I'll push this soon. I wondered if the same reasoning applies
to __arm__ / __arm nowadays, but a quick search does indicate that
__arm exists (existed?), at least.

--
John Naylor
EDB: http://www.enterprisedb.com

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#8)

Re: use ARM intrinsics in pg_lfind32() where available

On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:

On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:

- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

Given the quoted part above, it doesn't seem likely, but we should try
to find out for sure, because a runtime fault is surely not acceptable
even on a toy system.

The ARM literature appears to indicate that Neon support is pretty standard
on aarch64, and AFAICT it's pretty common to just assume it's available.
As originally suspected, I believe that simply checking for __aarch64__
would be sufficient, but I don't think it would be unreasonable to also
check for __ARM_NEON to be safe.

Interestingly, Clang still defines __ARM_NEON__ even when
+nosimd is specified.

POLA violation, but if no one has complained to them, it's a good bet
the instructions are always available.

Sorry, I should've been more specific. In my testing, I could include or
omit __ARM_NEON using +[no]simd, but __ARM_NEON__ (with two underscores at
the end) was always there. My brief research seems to indicate this might
be unique to Darwin, but in the end, it looks like __ARM_NEON (without the
trailing underscores) is the most widely used.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#10

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#9)

Re: use ARM intrinsics in pg_lfind32() where available

On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:

On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:

- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

Given the quoted part above, it doesn't seem likely, but we should try
to find out for sure, because a runtime fault is surely not acceptable
even on a toy system.

The ARM literature appears to indicate that Neon support is pretty standard
on aarch64, and AFAICT it's pretty common to just assume it's available.

This doesn't exactly rise to the level of "find out for sure", so I
went looking myself. This is the language I found [1]https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON:

"Both floating-point and NEON are required in all standard ARMv8
implementations. However, implementations targeting specialized
markets may support the following combinations:

No NEON or floating-point.
Full floating-point and SIMD support with exception trapping.
Full floating-point and SIMD support without exception trapping."

Since we assume floating-point, I see no reason not to assume NEON,
but a case could be made for documenting that we require NEON on
aarch64, in addition to exception trapping (for CRC runtime check) and
floating point on any Arm. Or even just say "standard". I don't
believe anyone will want to run Postgres on specialized hardware
lacking these features, so maybe it's a moot point.

[1]: https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON

--
John Naylor
EDB: http://www.enterprisedb.com

#11

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#10)

Re: use ARM intrinsics in pg_lfind32() where available

On Fri, Aug 26, 2022 at 10:45:10AM +0700, John Naylor wrote:

On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

The ARM literature appears to indicate that Neon support is pretty standard
on aarch64, and AFAICT it's pretty common to just assume it's available.

This doesn't exactly rise to the level of "find out for sure", so I
went looking myself. This is the language I found [1]:

"Both floating-point and NEON are required in all standard ARMv8
implementations. However, implementations targeting specialized
markets may support the following combinations:

No NEON or floating-point.
Full floating-point and SIMD support with exception trapping.
Full floating-point and SIMD support without exception trapping."

Sorry, I should've linked to the documentation I found. I saw similar
language in a couple of manuals, which is what led me to the conclusion
that Neon support is relatively standard.

Since we assume floating-point, I see no reason not to assume NEON,
but a case could be made for documenting that we require NEON on
aarch64, in addition to exception trapping (for CRC runtime check) and
floating point on any Arm. Or even just say "standard". I don't
believe anyone will want to run Postgres on specialized hardware
lacking these features, so maybe it's a moot point.

I'm okay with assuming Neon support for now. It's probably easier to add
the __ARM_NEON check if/when someone complains than it is to justify
removing it once it's there.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: Nathan Bossart (#11)

Re: use ARM intrinsics in pg_lfind32() where available

Here is a new patch set that applies on top of v9-0001 in the
json_lex_string patch set [0]/messages/by-id/CAFBsxsFV4v802idV0-Bo=V7wLMHRbOZ4er0hgposhyGCikmVGA@mail.gmail.com and v3 of the is_valid_ascii patch [1]/messages/by-id/CAFBsxsFFAZ6acUfyUALiem4DpCW=ApXbF02zrc0G0oT9CPof0Q@mail.gmail.com.

[0]: /messages/by-id/CAFBsxsFV4v802idV0-Bo=V7wLMHRbOZ4er0hgposhyGCikmVGA@mail.gmail.com
[1]: /messages/by-id/CAFBsxsFFAZ6acUfyUALiem4DpCW=ApXbF02zrc0G0oT9CPof0Q@mail.gmail.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#13

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: Nathan Bossart (#12)

Re: use ARM intrinsics in pg_lfind32() where available

On Thu, Aug 25, 2022 at 11:13:47PM -0700, Nathan Bossart wrote:

Here is a new patch set that applies on top of v9-0001 in the
json_lex_string patch set [0] and v3 of the is_valid_ascii patch [1].

Here is a rebased patch set that applies to HEAD.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#14

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#13)

Re: use ARM intrinsics in pg_lfind32() where available

On Sat, Aug 27, 2022 at 1:24 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Here is a rebased patch set that applies to HEAD.

0001:

#define USE_NO_SIMD
typedef uint64 Vector8;
+typedef uint64 Vector32;
#endif

I don't forsee any use of emulating vector registers with uint64 if
they only hold two ints. I wonder if it'd be better if all vector32
functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
declarations without definitions cause warnings...)

+ * NB: This function assumes that each lane in the given vector either has all
+ * bits set or all bits zeroed, as it is mainly intended for use with
+ * operations that produce such vectors (e.g., vector32_eq()).  If this
+ * assumption is not true, this function's behavior is undefined.
+ */

Hmm?

Also, is_highbit_set() already has uses same intrinsic and has the
same intended effect, since we only care about the boolean result.

0002:

-#elif defined(USE_SSE2)
+#elif defined(USE_SSE2) || defined(USE_NEON)

I think we can just say #else.

-#if defined(USE_SSE2)
- __m128i sub;
+#ifndef USE_NO_SIMD
+ Vector8 sub;

+#elif defined(USE_NEON)
+
+ /* use the same approach as the USE_SSE2 block above */
+ sub = vqsubq_u8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);

I think we should invent a helper that does saturating subtraction and
call that, inlining the sub var so we don't need to mess with it
further.

Otherwise seems fine.

--
John Naylor
EDB: http://www.enterprisedb.com

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: John Naylor (#14)

Re: use ARM intrinsics in pg_lfind32() where available

I spent a bit more time researching the portability implications of
this patch.  I think that we should check __ARM_NEON before #including
<arm_neon.h>; there is authoritative documentation out there telling
you to, eg [1], and I can see no upside at all to not checking.
We cannot check *only* __ARM_NEON, though.  I found it to get defined
by clang 8.0.0 in my Fedora 30 32-bit image, although that does not
provide all the instructions we want (I see "undefined function"
complaints for vmaxvq_u8 etc if I try to make it use the patch).
Looking into that installation's <arm_neon.h>, those functions are
defined conditionally if "__ARM_FP & 2", which is kind of interesting
--- per [1], that bit indicates support for 16-bit floating point,
which seems a mite unrelated.

It appears from the info at [2]http://micro-os-plus.github.io/develop/predefined-macros/ that there are at least some 32-bit
ARM platforms that set that bit, implying (if the clang authors are
well informed) that they have the instructions we want. But we
could not realistically make 32-bit builds that try to use those
instructions without a run-time test; such a build would fail for
too many people. I doubt that a run-time test is worth the trouble,
so I concur with the idea of selecting NEON on aarch64 only and hoping
to thereby avoid a runtime test.

In short, I think the critical part of 0002 needs to look more like
this:

+#elif defined(__aarch64__) && defined(__ARM_NEON)
+/*
+ * We use the Neon instructions if the compiler provides access to them
+ * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
+ * technically optional for aarch64, it appears that all available 64-bit
+ * hardware does have it.  Neon exists in some 32-bit hardware too, but
+ * we could not realistically use it there without a run-time check,
+ * which seems not worth the trouble for now.
+ */
+#include <arm_neon.h>
+#define USE_NEON
...

Coding like this appears to work on both my Apple M1 and my Raspberry
Pi, with several different OSes checked on the latter.

regards, tom lane

[1]: https://developer.arm.com/documentation/101754/0618/armclang-Reference/Other-Compiler-specific-Features/Predefined-macros
[2]: http://micro-os-plus.github.io/develop/predefined-macros/

#16

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#14)

Re: use ARM intrinsics in pg_lfind32() where available

Thanks for taking a look.

On Sat, Aug 27, 2022 at 01:59:06PM +0700, John Naylor wrote:

I don't forsee any use of emulating vector registers with uint64 if
they only hold two ints. I wonder if it'd be better if all vector32
functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
declarations without definitions cause warnings...)

Yeah. I was a bit worried about the readability of this file with so many
#ifndefs, but after trying it out, I suppose it doesn't look _too_ bad.

+ * NB: This function assumes that each lane in the given vector either has all
+ * bits set or all bits zeroed, as it is mainly intended for use with
+ * operations that produce such vectors (e.g., vector32_eq()).  If this
+ * assumption is not true, this function's behavior is undefined.
+ */

Hmm?

Yup. The problem is that AFAICT there's no equivalent to
_mm_movemask_epi8() on aarch64, so you end up with something like

vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0

But for pg_lfind32(), we really just want to know if any lane is set, which
only requires a call to vmaxvq_u32(). I haven't had a chance to look too
closely, but my guess is that this ultimately results in an extra AND
operation in the aarch64 path, so maybe it doesn't impact performance too
much. The other option would be to open-code the intrinsic function calls
into pg_lfind.h. I'm trying to avoid the latter, but maybe it's the right
thing to do for now... What do you think?

-#elif defined(USE_SSE2)
+#elif defined(USE_SSE2) || defined(USE_NEON)

I think we can just say #else.

Yes.

-#if defined(USE_SSE2)
- __m128i sub;
+#ifndef USE_NO_SIMD
+ Vector8 sub;
+#elif defined(USE_NEON)
+
+ /* use the same approach as the USE_SSE2 block above */
+ sub = vqsubq_u8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);
I think we should invent a helper that does saturating subtraction and
call that, inlining the sub var so we don't need to mess with it
further.

Good idea, will do.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#17

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: Tom Lane (#15)

Re: use ARM intrinsics in pg_lfind32() where available

On Sat, Aug 27, 2022 at 05:18:34PM -0400, Tom Lane wrote:

In short, I think the critical part of 0002 needs to look more like
this:

+#elif defined(__aarch64__) && defined(__ARM_NEON)
+/*
+ * We use the Neon instructions if the compiler provides access to them
+ * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
+ * technically optional for aarch64, it appears that all available 64-bit
+ * hardware does have it.  Neon exists in some 32-bit hardware too, but
+ * we could not realistically use it there without a run-time check,
+ * which seems not worth the trouble for now.
+ */
+#include <arm_neon.h>
+#define USE_NEON
...

Thank you for the analysis! I'll do it this way in the next patch set.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#18

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Nathan Bossart (#16)

Re: use ARM intrinsics in pg_lfind32() where available

On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

Yup. The problem is that AFAICT there's no equivalent to
_mm_movemask_epi8() on aarch64, so you end up with something like

vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0

But for pg_lfind32(), we really just want to know if any lane is set, which
only requires a call to vmaxvq_u32(). I haven't had a chance to look too
closely, but my guess is that this ultimately results in an extra AND
operation in the aarch64 path, so maybe it doesn't impact performance too
much. The other option would be to open-code the intrinsic function calls
into pg_lfind.h. I'm trying to avoid the latter, but maybe it's the right
thing to do for now... What do you think?

Ahh, this gives me a flashback to John's UTF-8 validation thread[1]/messages/by-id/CA+hUKGJjyXvS6W05kRVpH6Kng50=uOGxyiyjgPKm707JxQYHCg@mail.gmail.com
(the beginner NEON hackery in there was just a learning exercise,
sadly not followed up with real patches...). He had
_mm_movemask_epi8(v) != 0 which I first translated to
to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
vmaxvq_u8(v) > 0x7F has the right effect without the and.

[1]: /messages/by-id/CA+hUKGJjyXvS6W05kRVpH6Kng50=uOGxyiyjgPKm707JxQYHCg@mail.gmail.com

#19

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#18)

Re: use ARM intrinsics in pg_lfind32() where available

On Sun, Aug 28, 2022 at 10:39:09AM +1200, Thomas Munro wrote:

On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

Yup. The problem is that AFAICT there's no equivalent to
_mm_movemask_epi8() on aarch64, so you end up with something like

vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0

But for pg_lfind32(), we really just want to know if any lane is set, which
only requires a call to vmaxvq_u32(). I haven't had a chance to look too
closely, but my guess is that this ultimately results in an extra AND
operation in the aarch64 path, so maybe it doesn't impact performance too
much. The other option would be to open-code the intrinsic function calls
into pg_lfind.h. I'm trying to avoid the latter, but maybe it's the right
thing to do for now... What do you think?

Ahh, this gives me a flashback to John's UTF-8 validation thread[1]
(the beginner NEON hackery in there was just a learning exercise,
sadly not followed up with real patches...). He had
_mm_movemask_epi8(v) != 0 which I first translated to
to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
vmaxvq_u8(v) > 0x7F has the right effect without the and.