a couple of small patches for simd.h
0001 essentially reverts commit c6a43c2, and instead fixes the problem
(MSVC dislikes casts to the same type) by omitting the cast in the
problematic line in pg_lfind32(). While working on optimizing hex_encode()
and hex_decode() [0]https://commitfest.postgresql.org/patch/5538/, I noticed that implicit conversions sufficed.
0002 optimizes vector8_has_le() on AArch64 by using vminvq_u8(). I needed
vector8_has_ge() for the hex_encode()/hex_decode() work and noticed this
opportunity.
[0]: https://commitfest.postgresql.org/patch/5538/
--
nathan
Attachments:
v1-0001-Remove-vector32_is_highbit_set.patchtext/plain; charset=us-asciiDownload
From 699f0b0942c5390a46aa84ada95e5d38a9f21a10 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 22 Sep 2025 16:02:10 -0500
Subject: [PATCH v1 1/2] Remove vector32_is_highbit_set().
---
src/include/port/pg_lfind.h | 2 +-
src/include/port/simd.h | 23 -----------------------
2 files changed, 1 insertion(+), 24 deletions(-)
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 20f7497dcb7..0f460472601 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -139,7 +139,7 @@ pg_lfind32_simd_helper(const Vector32 keys, const uint32 *base)
result = vector32_or(tmp1, tmp2);
/* return whether there was a match */
- return vector32_is_highbit_set(result);
+ return vector8_is_highbit_set(result);
}
#endif /* ! USE_NO_SIMD */
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..6424e8f5a97 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -78,7 +78,6 @@ static inline bool vector8_has_zero(const Vector8 v);
static inline bool vector8_has_le(const Vector8 v, const uint8 c);
static inline bool vector8_is_highbit_set(const Vector8 v);
#ifndef USE_NO_SIMD
-static inline bool vector32_is_highbit_set(const Vector32 v);
static inline uint32 vector8_highbit_mask(const Vector8 v);
#endif
@@ -279,28 +278,6 @@ vector8_is_highbit_set(const Vector8 v)
#endif
}
-/*
- * Exactly like vector8_is_highbit_set except for the input type, so it
- * looks at each byte separately.
- *
- * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
- * integer elements, but Arm does not, hence the need for a separate
- * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
- * check each 32-bit element, but that would require an additional mask
- * operation on x86.
- */
-#ifndef USE_NO_SIMD
-static inline bool
-vector32_is_highbit_set(const Vector32 v)
-{
-#if defined(USE_NEON)
- return vector8_is_highbit_set((Vector8) v);
-#else
- return vector8_is_highbit_set(v);
-#endif
-}
-#endif /* ! USE_NO_SIMD */
-
/*
* Return a bitmask formed from the high-bit of each element.
*/
--
2.39.5 (Apple Git-154)
v1-0002-Optimize-vector8_has_le-on-AArch64.patchtext/plain; charset=us-asciiDownload
From 43b7c5f60e309c3404771aee2ca9488790db56de Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 22 Sep 2025 16:17:09 -0500
Subject: [PATCH v1 2/2] Optimize vector8_has_le() on AArch64.
---
src/include/port/simd.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 6424e8f5a97..a594f1c1de3 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -249,6 +249,8 @@ vector8_has_le(const Vector8 v, const uint8 c)
}
}
}
+#elif defined(USE_NEON)
+ result = vminvq_u8(v) <= c;
#else
/*
--
2.39.5 (Apple Git-154)
On Tue, Sep 23, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
0001 essentially reverts commit c6a43c2, and instead fixes the problem
(MSVC dislikes casts to the same type) by omitting the cast in the
problematic line in pg_lfind32(). While working on optimizing hex_encode()
and hex_decode() [0], I noticed that implicit conversions sufficed.
I don't remember why that cast was there, but I suggest testing on gcc
/ aarch64 if you haven't already.
As for 0002, +1 for more succinct expressions when the hardware supports it.
--
John Naylor
Amazon Web Services
On Wed, Sep 24, 2025 at 08:04:59AM +0700, John Naylor wrote:
On Tue, Sep 23, 2025 at 4:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
0001 essentially reverts commit c6a43c2, and instead fixes the problem
(MSVC dislikes casts to the same type) by omitting the cast in the
problematic line in pg_lfind32(). While working on optimizing hex_encode()
and hex_decode() [0], I noticed that implicit conversions sufficed.I don't remember why that cast was there, but I suggest testing on gcc
/ aarch64 if you haven't already.
Ah, I thought CI was testing that, but apparently it is not. After some
basic testing in godbolt.org, I see that leaving out the cast makes
gcc/arm64 unhappy, and adding the cast makes msvc/x64 unhappy. gcc has a
-flax-vector-conversions option that fixes it, but the documentation for
that option [0]https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html#index-flax-vector-conversions cautions against using it. So, 0001 is bogus, and I need
to figure out how to get the hex_{encode,decode} patches working for
gcc/arm64...
[0]: https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html#index-flax-vector-conversions
--
nathan
Here's a new version of 0002 with a modified SSE2 implementation, as
discussed elsewhere [0]/messages/by-id/aNWO7L43UevRErw_@nathan. This allows us to remove vector8_ssub().
[0]: /messages/by-id/aNWO7L43UevRErw_@nathan
--
nathan
Attachments:
v2-0001-Optimize-vector8_has_le-on-AArch64.patchtext/plain; charset=us-asciiDownload
From a644f049add68ea78326f88d8994898c92f23c20 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 22 Sep 2025 16:17:09 -0500
Subject: [PATCH v2 1/1] Optimize vector8_has_le() on AArch64.
---
src/include/port/simd.h | 32 ++++++--------------------------
1 file changed, 6 insertions(+), 26 deletions(-)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..e317e5fdfc0 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -86,7 +86,6 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
#ifndef USE_NO_SIMD
static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
#endif
/*
@@ -250,14 +249,13 @@ vector8_has_le(const Vector8 v, const uint8 c)
}
}
}
-#else
+#elif defined(USE_SSE2)
+ Vector8 umin = _mm_min_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umin, v);
- /*
- * Use saturating subtraction to find bytes <= c, which will present as
- * NUL bytes. This approach is a workaround for the lack of unsigned
- * comparison instructions on some architectures.
- */
- result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+ result = vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+ result = vminvq_u8(v) <= c;
#endif
Assert(assert_result == result);
@@ -358,24 +356,6 @@ vector32_or(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
-/*
- * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead). For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
- */
-#ifndef USE_NO_SIMD
-static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
-{
-#ifdef USE_SSE2
- return _mm_subs_epu8(v1, v2);
-#elif defined(USE_NEON)
- return vqsubq_u8(v1, v2);
-#endif
-}
-#endif /* ! USE_NO_SIMD */
-
/*
* Return a vector with all bits set in each lane where the corresponding
* lanes in the inputs are equal.
--
2.39.5 (Apple Git-154)
On Thu, Sep 25, 2025 at 02:10:19PM -0500, Nathan Bossart wrote:
Here's a new version of 0002 with a modified SSE2 implementation, as
discussed elsewhere [0]. This allows us to remove vector8_ssub().
Sorry for the noise. v3 fixes the mixed-declarations-and-code problems.
--
nathan
Attachments:
v3-0001-Optimize-vector8_has_le-on-AArch64.patchtext/plain; charset=us-asciiDownload
From 2d355136c9b6aede2c1cb6f8f08fa7eacd032b61 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 22 Sep 2025 16:17:09 -0500
Subject: [PATCH v3 1/1] Optimize vector8_has_le() on AArch64.
---
src/include/port/simd.h | 37 ++++++++++---------------------------
1 file changed, 10 insertions(+), 27 deletions(-)
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..79dd1f76962 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -86,7 +86,6 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
#ifndef USE_NO_SIMD
static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
#endif
/*
@@ -213,6 +212,10 @@ static inline bool
vector8_has_le(const Vector8 v, const uint8 c)
{
bool result = false;
+#ifdef USE_SSE2
+ Vector8 umin;
+ Vector8 cmpe;
+#endif
/* pre-compute the result for assert checking */
#ifdef USE_ASSERT_CHECKING
@@ -250,14 +253,12 @@ vector8_has_le(const Vector8 v, const uint8 c)
}
}
}
-#else
-
- /*
- * Use saturating subtraction to find bytes <= c, which will present as
- * NUL bytes. This approach is a workaround for the lack of unsigned
- * comparison instructions on some architectures.
- */
- result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+#elif defined(USE_SSE2)
+ umin = _mm_min_epu8(v, vector8_broadcast(c));
+ cmpe = _mm_cmpeq_epi8(umin, v);
+ result = vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+ result = vminvq_u8(v) <= c;
#endif
Assert(assert_result == result);
@@ -358,24 +359,6 @@ vector32_or(const Vector32 v1, const Vector32 v2)
}
#endif /* ! USE_NO_SIMD */
-/*
- * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead). For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
- */
-#ifndef USE_NO_SIMD
-static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
-{
-#ifdef USE_SSE2
- return _mm_subs_epu8(v1, v2);
-#elif defined(USE_NEON)
- return vqsubq_u8(v1, v2);
-#endif
-}
-#endif /* ! USE_NO_SIMD */
-
/*
* Return a vector with all bits set in each lane where the corresponding
* lanes in the inputs are equal.
--
2.39.5 (Apple Git-154)
On Thu, Sep 25, 2025 at 04:27:35PM -0500, Nathan Bossart wrote:
On Thu, Sep 25, 2025 at 02:10:19PM -0500, Nathan Bossart wrote:
Here's a new version of 0002 with a modified SSE2 implementation, as
discussed elsewhere [0]. This allows us to remove vector8_ssub().Sorry for the noise. v3 fixes the mixed-declarations-and-code problems.
Barring objections, I plan to commit this soon. Probably tomorrow.
--
nathan
On Fri, Oct 3, 2025 at 12:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
Barring objections, I plan to commit this soon. Probably tomorrow.
LGTM.
--
John Naylor
Amazon Web Services