introduce optimized linear search functions that return index of matching element

Started by Nathan Bossartabout 3 years ago2 messages
#1Nathan Bossart
Nathan Bossart
nathandbossart@gmail.com
1 attachment(s)

On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:

v6 demonstrates why this should have been put off towards the end. (more below)

Since the SIMD code is fresh in my mind, I wanted to offer my review for
0001 in the "Improve dead tuple storage for lazy vacuum" thread [0]/messages/by-id/CAD21AoD3w76wERs_Lq7_uA6+gTaoOERPji+Yz8Ac6aui4JwvTg@mail.gmail.com.
However, I agree with John that the SIMD part of that work should be left
for the end, and I didn't want to distract from the radix tree part too
much. So, here is a new thread for just the SIMD part.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

I don't think it's clear that the "lfind" functions return whether there is
a match while the "lsearch" functions return the index of the first match.
It might be better to call these something like "pg_lfind8_idx" and
"pg_lfind8_ge_idx" instead.

+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.

That's a bizarre API to indicate non-existence.

+1. It should probably just return -1 in that case.

+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */

That is *completely* unacceptable for a general-purpose function.

+1

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;

I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.

+1

+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:

https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

The technique demonstrated in this article seems to work nicely.

For these kinds of patches, I find the best way to review them is to try
out my proposed changes as I'm reading through the patch. I hope you don't
mind that I've done so here and attached a new version of the patch. In
addition to addressing the aforementioned feedback, I made the following
changes:

* I renamed the vector8_search_* functions to vector8_find() and
vector8_find_ge(). IMO this is more in the spirit of existing function
names like vector8_has().

* I simplified vector8_find_ge() by essentially making it do the opposite
of what vector8_has_le() does (i.e., using saturating subtraction to find
matching bytes). This removes the need for vector8_min(), and since
vector8_find_ge() can just call vector8_search() to find any 0 bytes,
vector8_highbit_mask() can be removed as well.

* I simplified the test for pg_lfind8_ge_idx() by making it look a little
more like the test for pg_lfind32(). I wasn't sure about the use of rand()
and qsort(), and overall it just felt a little too complicated to me.

I've tested all three code paths (i.e., SSE2, Neon, and USE_NO_SIMD), but I
haven't done any performance analysis yet.

[0]: /messages/by-id/CAD21AoD3w76wERs_Lq7_uA6+gTaoOERPji+Yz8Ac6aui4JwvTg@mail.gmail.com

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-introduce-pg_lfind8_idx-and-pg_lfind8_ge_idx.patchtext/x-diff; charset=us-ascii
#2John Naylor
John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#1)
Re: introduce optimized linear search functions that return index of matching element

On Sat, Sep 17, 2022 at 12:29 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:

v6 demonstrates why this should have been put off towards the end.

(more below)

Since the SIMD code is fresh in my mind, I wanted to offer my review for
0001 in the "Improve dead tuple storage for lazy vacuum" thread [0].
However, I agree with John that the SIMD part of that work should be left
for the end

As I mentioned in the radix tree thread, I don't believe this level of
abstraction is appropriate for the intended use case. We'll want to
incorporate some of the low-level simd.h improvements later, so you should
get authorship credit for those. I've marked the entry "returned with
feedback".

--
John Naylor
EDB: http://www.enterprisedb.com