[PATCH] Hex-coding optimizations using SVE on ARM.

Started by Devanga.Susmitha@fujitsu.comabout 1 year ago48 messages
Jump to latest
#1Devanga.Susmitha@fujitsu.com
Devanga.Susmitha@fujitsu.com

Hello,

This email aims to discuss the contribution of optimized hex_encode and hex_decode functions for ARM (aarch64) machines. These functions are widely used for encoding and decoding binary data in the bytea data type.

The current method for hex_encode and hex_decode relies on a scalar implementation that processes data byte by byte, with no SIMD-based optimization available. With the introduction of SVE optimizations, we leverage CPU intrinsics to process larger data blocks in parallel, significantly reducing encoding and decoding latency.

We have designed this feature to ensure compatibility and robustness. It includes compile-time and runtime checks for SVE compatibility with both the compiler and hardware. If either check fails, the code falls back to the existing scalar implementation, ensuring fail-safe operation.

For the architecture-specific functions, we have used pg_attribute_target("arch=armv8-a+sve") to compile, enabling precise compiler control without using extra global CFLAGS.

System Configurations
Machine: AWS EC2 m7g.4xlarge
OS: Ubuntu 22.04
GCC: 11.4

Benchmark and Results
Setup:
We have developed a microbenchmark based on [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to evaluate the performance of the SVE-enabled hex_encode and hex_decode functions compared to the default implementation across various input sizes. The microbenchmark patch is attached in the mail.

Query:
time psql -c "select hex_decode_test(1000000, input_size);"
time psql -c "select hex_decode_test(1000000, input_size);"
The query was executed for input sizes ranging from 8 to 262144 bytes.

Results:
Significant speed-up in query performance has been observed up to 17 times for hex_encode and up to 4 times for hex_decode.

Additionally, we tested the optimization with the bytea data type on a table of size 1435 MB containing two columns: the first an auto-incrementing ID and the second a bytea column holding binary data. We then ran the query "SELECT data FROM bytea_table" using a script and recorded the time taken by hex_encode using perf. The results are presented below.

Default scalar implementation:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE-enabled implementation:
Query exec time: 1.654 sec (approximately 1.7 times improvement)
hex_encode_sve function time: 0.085 sec (approximately 14.44 times improvement)

Improvements using SVE are noticeable starting from an input size of 16 bytes for hex_encode and 32 bytes for hex_decode. Hence, SVE implementations are used only when the input size surpasses these thresholds.

We would like to contribute our above work so that it can be available for the community to utilize. To do so, we are following the procedure mentioned in Submitting a Patch - PostgreSQL wiki<https://wiki.postgresql.org/wiki/Submitting_a_Patch&gt;. Please find the attachment for the patches and performance results.

Please let us know if you have any queries or suggestions.

Thanks & Regards,
Susmitha Devanga.

[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Attachments:

hex_decode_woFlags.PNGimage/png; name=hex_decode_woFlags.PNGDownload
hex_encode_woFlags.PNGimage/png; name=hex_encode_woFlags.PNGDownload+2-0
v1-0001-test-module-for-hex-coding 1.patchapplication/octet-stream; name="v1-0001-test-module-for-hex-coding 1.patch"Download+113-1
v1-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v1-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload+397-1
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Devanga.Susmitha@fujitsu.com (#1)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Jan 09, 2025 at 11:22:05AM +0000, Devanga.Susmitha@fujitsu.com wrote:

This email aims to discuss the contribution of optimized hex_encode and
hex_decode functions for ARM (aarch64) machines. These functions are
widely used for encoding and decoding binary data in the bytea data type.

Thank you for sharing this work! I'm not able to review this in depth at
the moment, but I am curious if you considered trying to enable
auto-vectorization on the code or using the higher-level SIMD support in
src/include/port/simd.h. Those may not show as impressive of gains as your
patch, but they would likely require much less code and apply to a wider
set of architectures.

--
nathan

#3Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#2)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Hello Nathan,

We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD architectures like NEON, whereas the patch uses the newer SVE, so some of the instructions used in the patch may not have direct equivalents in NEON. We will check the feasibility of integrating SVE in "src/include/port/simd.h" and get back to you.
The actual encoding/decoding implementation takes less than 100 lines. The rest of the code is related to config and the "choose" logic. One option is to move the implementation to a new file, making src/backend/utils/adt/encode.c less bloated.

Thanks,
Chiranmoy

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#3)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 11:10:03AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

We tried auto-vectorization and observed no performance improvement.

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

The instructions in src/include/port/simd.h are based on older SIMD
architectures like NEON, whereas the patch uses the newer SVE, so some of
the instructions used in the patch may not have direct equivalents in
NEON. We will check the feasibility of integrating SVE in
"src/include/port/simd.h" and get back to you.

Thanks!

--
nathan

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#4)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:

On Fri, Jan 10, 2025 at 11:10:03AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

We tried auto-vectorization and observed no performance improvement.

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

I was able to get auto-vectorization to take effect on Apple clang 16 with
the following addition to src/backend/utils/adt/Makefile:

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8

This gave the following results with your hex_encode_test() function:

buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 21 | 16 | 24
64 | 54 | 41 | 24
256 | 138 | 100 | 28
1024 | 441 | 300 | 32
4096 | 1671 | 1106 | 34
16384 | 6890 | 4570 | 34
65536 | 27393 | 18054 | 34

This doesn't compare with the gains you are claiming to see with
intrinsics, but it's not bad for a one line change. I bet there are ways
to adjust the code so that the auto-vectorization is more effective, too.

--
nathan

#6Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#5)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

Auto-vectorization is working now with the following addition on Graviton 3 (m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, auto-vectorization had no effect because we missed the -march=native option.

      encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

buf | default | auto_vec | SVE
--------+-------+--------+-------
16 | 16 | 12 | 8
64 | 58 | 40 | 9
256 | 223 | 152 | 18
1024 | 934 | 613 | 54
4096 | 3533 | 2430 | 202
16384 | 14081 | 9831 | 800
65536 | 56374 | 38702 | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control flow.

-----
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We need a runtime check to be sure.. Using src/include/port/simd.h to abstract away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-----
Chiranmoy

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#6)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Mon, Jan 13, 2025 at 03:48:49PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

There is a 30% improvement using auto-vectorization.

It might be worth enabling auto-vectorization independently of any patches
that use intrinsics, then.

Currently, it is assumed that all aarch64 machine support NEON, but for
newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not
hold. We need a runtime check to be sure.. Using src/include/port/simd.h
to abstract away these advanced SIMD implementations may be difficult.

Yeah, moving simd.h to anything beyond Neon/SSE2 might be tricky at the
moment. Besides the need for additional runtime checks, using wider
registers can mean that you need more data before an optimization takes
effect, which is effectively a regression. I ran into this when I tried to
add AVX2 support to simd.h [0]/messages/by-id/20231129171526.GA857928@nathanxps13. My question about using simd.h was
ultimately about abstracting the relevant Neon/SSE2 instructions and using
those for hex_encode/decode(). If that's possible, I think it'd be
interesting to see how that compares to the SVE version.

[0]: /messages/by-id/20231129171526.GA857928@nathanxps13

--
nathan

#8John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#5)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Sat, Jan 11, 2025 at 3:46 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I was able to get auto-vectorization to take effect on Apple clang 16 with
the following addition to src/backend/utils/adt/Makefile:

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8

This gave the following results with your hex_encode_test() function:

buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 21 | 16 | 24
64 | 54 | 41 | 24
256 | 138 | 100 | 28
1024 | 441 | 300 | 32
4096 | 1671 | 1106 | 34
16384 | 6890 | 4570 | 34
65536 | 27393 | 18054 | 34

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(1000000, 1024);
master:
Time: 1158.700 ms
v2:
Time: 777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

--
John Naylor
Amazon Web Services

Attachments:

v2-byte-lookup.patchtext/x-patch; charset=US-ASCII; name=v2-byte-lookup.patchDownload+3-4
#9Michael Paquier
michael@paquier.xyz
In reply to: John Naylor (#8)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 12:27:30PM +0700, John Naylor wrote:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(1000000, 1024);
master:
Time: 1158.700 ms
v2:
Time: 777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

That's pretty cool. Complex to parse, still really cool.
--
Michael

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#8)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

regards, tom lane

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#10)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

Nice. I tried enabling auto-vectorization and loop unrolling on top of
this patch, and the numbers looked the same. I think we'd need CPU
intrinsics or an even bigger lookup table to do any better.

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

+1

--
nathan

#12John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#11)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 11:57 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

Nice. I tried enabling auto-vectorization and loop unrolling on top of
this patch, and the numbers looked the same. I think we'd need CPU
intrinsics or an even bigger lookup table to do any better.

Thanks for looking further! Yeah, I like that the table is still only 512 bytes.

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

+1

Okay, I added a comment. I also agree with Michael that my quick
one-off was a bit hard to read so I've cleaned it up a bit. I plan to
commit the attached by Friday, along with any bikeshedding that
happens by then.

--
John Naylor
Amazon Web Services

Attachments:

v3-0001-Speed-up-hex_encode-with-bytewise-lookup.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Speed-up-hex_encode-with-bytewise-lookup.patchDownload+26-4
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#12)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

John Naylor <johncnaylorls@gmail.com> writes:

Okay, I added a comment. I also agree with Michael that my quick
one-off was a bit hard to read so I've cleaned it up a bit. I plan to
commit the attached by Friday, along with any bikeshedding that
happens by then.

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of

+ *dst++ = hextbl[2 * usrc];
+ *dst++ = hextbl[2 * usrc + 1];

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

A variant could be

+		const char *hexptr = &hextbl[2 * usrc];
+		*dst++ = hexptr[0];
+		*dst++ = hexptr[1];

but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

regards, tom lane

#14John Naylor
john.naylor@enterprisedb.com
In reply to: Tom Lane (#13)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

Works for me. With that, did you mean we then wouldn't need a comment
in the code?

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

Thanks for the reminder!

3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of

+               *dst++ = hextbl[2 * usrc];
+               *dst++ = hextbl[2 * usrc + 1];

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?

A variant could be

+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];

but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

This combines to a single word starting with clang 5, but does not
work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
on my machine bytewise load/stores are somewhere in the middle:

master 1158.969 ms
v3 776.791 ms
variant 4A 775.777 ms
variant 4B 969.945 ms

https://godbolt.org/z/ajToordKq

--
John Naylor
Amazon Web Services

#15Ranier Vilela
ranier.vf@gmail.com
In reply to: John Naylor (#14)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Hi.

Em qua., 15 de jan. de 2025 às 07:57, John Naylor <johncnaylorls@gmail.com>
escreveu:

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

Works for me. With that, did you mean we then wouldn't need a comment
in the code?

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

Thanks for the reminder!

3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of

+               *dst++ = hextbl[2 * usrc];
+               *dst++ = hextbl[2 * usrc + 1];

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?

A variant could be

+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];

but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

This combines to a single word starting with clang 5, but does not
work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
on my machine bytewise load/stores are somewhere in the middle:

master 1158.969 ms
v3 776.791 ms
variant 4A 775.777 ms
variant 4B 969.945 ms

https://godbolt.org/z/ajToordKq

Your example from godbolt, has a
have an important difference, which modifies the assembler result.

-static const char hextbl[] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;
+static const char hextbl[512] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;

best regards,
Ranier Vilela

#16David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#14)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, 15 Jan 2025 at 23:57, John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?

A variant could be

+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];

I'd personally much rather see us using memcpy() for this sort of
stuff. If the compiler is too braindead to inline tiny
constant-and-power-of-two-sized memcpys then we'd probably also have
plenty of other performance issues with that compiler already. I don't
think contorting the code into something less human-readable and
something the compiler may struggle even more to optimise is a good
idea. The nieve way to implement the above requires two MOVs of
single bytes and two increments of dst. I imagine it's easier for the
compiler to inline a small constant-sized memcpy() than to figure out
that it's safe to implement the above with a single word-sized MOV
rather than two byte-sized MOVs due to the "dst++" in between the two.

I agree that the evidence you (John) gathered is enough reason to use memcpy().

David

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#16)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

David Rowley <dgrowleyml@gmail.com> writes:

I agree that the evidence you (John) gathered is enough reason to use memcpy().

Okay ... doesn't quite match my intuition, but intuition is a poor
guide to such things.

regards, tom lane

#18Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#17)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

With commit e24d770 in place, I took a closer look at hex_decode(), and I
concluded that doing anything better without intrinsics would likely
require either a huge lookup table or something with complexity rivalling
the instrinsics approach (while also not rivalling its performance). So, I
took a closer look at the instrinsics patches and had the following
thoughts:

* The approach looks generally reasonable to me, but IMHO the code needs
much more commentary to explain how it works.

* The functions that test the length before potentially calling a function
pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
I wouldn't be surprised if some compilers are inlining this stuff
already, but it's probably worth being explicit about it.

* Finally, I think we should ensure we've established a really strong case
for this optimization. IME these intrinsics patches require a ton of
time and energy, and the code is often extremely complex. I would be
interested to see how your bytea test compares with the improvements
added in commit e24d770 and with sending the data in binary.

--
nathan

#19Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#18)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

The approach looks generally reasonable to me, but IMHO the code needs

much more commentary to explain how it works.

Added comments to explain the SVE implementation.

I would be interested to see how your bytea test compares with the

improvements added in commit e24d770 and with sending the data in binary.

The following are the bytea test results with commit e24d770.
The same query and tables were used.

With commit e24d770:
Query exec time: 2.324 sec
hex_encode function time: 0.72 sec

Pre-commit e24d770:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE patch:
Query exec time: 1.654 sec
hex_encode_sve function time: 0.085 sec

The functions that test the length before potentially calling a function
pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
I wouldn't be surprised if some compilers are inlining this stuff
already, but it's probably worth being explicit about it.

Should we implement an inline function in "utils/builtins.h", similar to
pg_popcount()? Currently, we have not modified the header file, everything
is statically implemented in encode.c.

---
Chiranmoy

#20Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#19)
Re: [PATCH] Hex-coding optimizations using SVE on ARM.

I realized I didn't attach the patch.

Attachments:

v2-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v2-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload+416-1
#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#19)
#22Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#20)
#23Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#22)
#24Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#23)
#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#24)
#26Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#25)
#27Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#26)
#28Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#27)
#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#28)
#30John Naylor
john.naylor@enterprisedb.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#28)
#31Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: John Naylor (#30)
#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#31)
#33Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#32)
#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#34)
#36Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Tom Lane (#35)
#37Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#36)
#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#37)
#39Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#38)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#39)
#41Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#40)
#42John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#41)
#43Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#42)
#44John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#43)
#45Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#44)
#46John Naylor
john.naylor@enterprisedb.com
In reply to: Nathan Bossart (#45)
#47Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#46)
#48Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#47)