Question about the Implementation of vector32_is_highbit_set on ARM

Started by Xiang Gaoabout 2 years ago4 messages
#1Xiang Gao
Xiang.Gao@arm.com

Hi all,
I have some questions about the implementation of vector32_is_highbit_set on arm.
Below is the comment and the implementation for this function.
/*
* Exactly like vector8_is_highbit_set except for the input type, so it
* looks at each byte separately.
*
* XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
* integer elements, but Arm does not, hence the need for a separate
* function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
* check each 32-bit element, but that would require an additional mask
* operation on x86.
*/
#ifndef USE_NO_SIMD
static inline bool
vector32_is_highbit_set(const Vector32 v)
{
#if defined(USE_NEON)
return vector8_is_highbit_set((Vector8) v);
#else
return vector8_is_highbit_set(v);
#endif
}
#endif /* ! USE_NO_SIMD */

But I still don't understand why the vmaxvq_u32 intrinsic is not used on the arm platform.
We have used the macro USE_NEON to distinguish different platforms.
In addition, according to the "Arm Neoverse N1 Software Optimization Guide",
The vmaxvq_u32 intrinsic has half the latency of vmaxvq_u8 and twice the bandwidth.
So I think just use vmaxvq_u32 directly.

Any comments or feedback are welcome.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

#2John Naylor
johncnaylorls@gmail.com
In reply to: Xiang Gao (#1)
1 attachment(s)
Re: Question about the Implementation of vector32_is_highbit_set on ARM

On Wed, Nov 8, 2023 at 2:44 PM Xiang Gao <Xiang.Gao@arm.com> wrote:

* function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
* check each 32-bit element, but that would require an additional mask
* operation on x86.
*/

But I still don't understand why the vmaxvq_u32 intrinsic is not used on the arm platform.

The current use case expects all 1's or all 0's in a 32-bit lane. If
anyone tried using it for arbitrary values, vmaxvq_u32 could give a
different answer than on x86 using _mm_movemask_epi8, so I think
that's the origin of that comment. But it's still a maintenance hazard
as is, since x86 wouldn't work for arbitrary values. It seems the path
forward is to rename this function to vector32_is_any_lane_set(), as
in the attached (untested on Arm). That would allow each
implementation to use the most efficient path, whether it's by 8- or
32-bit lanes. If we someday needed to look at only the high bits, we
would need a new function that performed the necessary masking on x86.

It's possible this method could shave cycles on Arm in some 8-bit lane
cases where we don't actually care about the high bit specifically,
since the movemask equivalent is slow on that platform, but I haven't
looked yet.

Attachments:

v1-is_any_lane_set.patchtext/x-patch; charset=US-ASCII; name=v1-is_any_lane_set.patchDownload
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 59aa8245ed..f536905d4d 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -151,7 +151,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 		result = vector32_or(tmp1, tmp2);
 
 		/* see if there was a match */
-		if (vector32_is_highbit_set(result))
+		if (vector32_is_any_lane_set(result))
 		{
 			Assert(assert_result == true);
 			return true;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..40558bbca8 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -78,7 +78,7 @@ static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
-static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline bool vector32_is_any_lane_set(const Vector32 v);
 #endif
 
 /* arithmetic operations */
@@ -278,22 +278,21 @@ vector8_is_highbit_set(const Vector8 v)
 }
 
 /*
- * Exactly like vector8_is_highbit_set except for the input type, so it
- * looks at each byte separately.
+ * Return true if any 32-bit lane is set, otherwise false.
  *
- * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
- * integer elements, but Arm does not, hence the need for a separate
- * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
- * check each 32-bit element, but that would require an additional mask
- * operation on x86.
+ * XXX: We assume all lanes are either all zeros or all ones.
  */
 #ifndef USE_NO_SIMD
 static inline bool
-vector32_is_highbit_set(const Vector32 v)
+vector32_is_any_lane_set(const Vector32 v)
 {
 #if defined(USE_NEON)
-	return vector8_is_highbit_set((Vector8) v);
+	return vmaxvq_u32(v) > 0;
 #else
+	/*
+	 * There is no 32-bit version of _mm_movemask_epi8, but we can still use
+	 * the 8-bit version.
+	 */
 	return vector8_is_highbit_set(v);
 #endif
 }
#3Xiang Gao
Xiang.Gao@arm.com
In reply to: John Naylor (#2)
RE: Question about the Implementation of vector32_is_highbit_set on ARM

On Date: Mon, 20 Nov 2023 16:05:43PM +0700, John Naylor wrote:

On Wed, Nov 8, 2023 at 2:44=E2=80=AFPM Xiang Gao <Xiang.Gao@arm.com> wrote:

* function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i=

.e.

* check each 32-bit element, but that would require an additional mask
* operation on x86.
*/

But I still don't understand why the vmaxvq_u32 intrinsic is not used on=

the arm platform.

The current use case expects all 1's or all 0's in a 32-bit lane. If
anyone tried using it for arbitrary values, vmaxvq_u32 could give a
different answer than on x86 using _mm_movemask_epi8, so I think
that's the origin of that comment. But it's still a maintenance hazard
as is, since x86 wouldn't work for arbitrary values. It seems the path
forward is to rename this function to vector32_is_any_lane_set(), as
in the attached (untested on Arm). That would allow each
implementation to use the most efficient path, whether it's by 8- or
32-bit lanes. If we someday needed to look at only the high bits, we
would need a new function that performed the necessary masking on x86.

It's possible this method could shave cycles on Arm in some 8-bit lane
cases where we don't actually care about the high bit specifically,
since the movemask equivalent is slow on that platform, but I haven't
looked yet.

Thank you for your detailed explanation.
Can I do some testing and submit this patch?
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

#4John Naylor
johncnaylorls@gmail.com
In reply to: Xiang Gao (#3)
Re: Question about the Implementation of vector32_is_highbit_set on ARM

On Thu, Nov 23, 2023 at 4:29 PM Xiang Gao <Xiang.Gao@arm.com> wrote:

Thank you for your detailed explanation.
Can I do some testing and submit this patch?

Please do, thanks.