Optimize Arm64 crc32c implementation in Postgresql
Hi all
Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.
Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented by inline assembly,
so they can also benefit from hardware acceleration in IO-intensive workloads.
I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.
The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.
And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.
I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.
BRs,
Yuqi
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Attachments:
0001-Optimize-Arm64-crc32c-implementation-in-Postgresql.patchapplication/octet-stream; name=0001-Optimize-Arm64-crc32c-implementation-in-Postgresql.patchDownload+183-19
On Wed, Jan 10, 2018 at 05:58:19AM +0000, Yuqi Gu wrote:
I would like to propose the patch to optimize crc32c calculation with
Arm64 specific instructions. The hardware-specific code
implementation is used under #if defined
USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK. And the performance is improved
on platforms: cortex-A57, cortex-A72, cortex-A73, etc.I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.
Nice! I have not looked at your patch, but +1. There are not enough
patches for ARM.
--
Michael
Hi,
On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:
Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.
Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented by inline assembly,
so they can also benefit from hardware acceleration in IO-intensive workloads.I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.
The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.
And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.
Could you show whether it actually improves performance? Usually bulk
loading data with parallel COPYs is a good way to hit this codepath.
+ +AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS], +[AC_CACHE_CHECK([for Arm64ce CRC32], [pgac_cv_arm64ce_crc32_intrinsics], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([], + [unsigned int arm_flag = 0; +#if defined(__ARM_ARCH) && (__ARM_ARCH > 7) + arm_flag = 1; +#endif + return arm_flag == 1;])], + [pgac_cv_arm64ce_crc32_intrinsics="yes"], + [pgac_cv_arm64ce_crc32_intrinsics="no"])]) +if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then + pgac_arm64ce_crc32_intrinsics=yes +fi +])# PGAC_ARM64CE_CRC32_INTRINSICS
I don't understand what this check is supposed to be doing?
AC_LINK_IFELSE doesn't run the program, so I don't see this test
achieving anything at all?
* Use slicing-by-8 algorithm. diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c index 40bee67..d3682ad 100644 --- a/src/port/pg_crc32c_choose.c +++ b/src/port/pg_crc32c_choose.c @@ -29,6 +29,20 @@#include "port/pg_crc32c.h"
+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK +#include <sys/auxv.h> +#include <asm/hwcap.h> +#ifndef HWCAP_CRC32 +#define HWCAP_CRC32 (1 << 7) +#endif
+static bool +pg_crc32c_arm64ce_available(void) { + unsigned long auxv = getauxval(AT_HWCAP); + return (auxv & HWCAP_CRC32) != 0; +} + +#else
What's the availability of these headers and functions on non-linux platforms?
+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK) +asm(".arch_extension crc");
So this emitted globally?
+#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" : [a]"=r"(x),[b]"=r"(y),[c]"+r"(p)) +/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */ +#define CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) +#define CRC32CW(crc,value) asm("crc32cw %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) +#define CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) +#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) + +pg_crc32c +pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) { + uint64 p0, p1; + pg_crc32c crc32_c = crc; + long length = len; + const unsigned char *p_buf = data; + + /* Allow crc instructions in asm */ + asm(".cpu generic+crc");
Hm, this switches it for the rest of the function, program, ...?
Greetings,
Andres Freund
Hi,
On 2018-01-10 15:09:16 +0900, Michael Paquier wrote:
There are not enough patches for ARM.
Nah, not needing arch specific patches is good ;)
Greetings,
Andres Freund
On Fri, Mar 2, 2018 at 10:36 AM, Andres Freund <andres@anarazel.de> wrote:
On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:
+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK +#include <sys/auxv.h> +#include <asm/hwcap.h> +#ifndef HWCAP_CRC32 +#define HWCAP_CRC32 (1 << 7) +#endif+static bool +pg_crc32c_arm64ce_available(void) { + unsigned long auxv = getauxval(AT_HWCAP); + return (auxv & HWCAP_CRC32) != 0; +} + +#elseWhat's the availability of these headers and functions on non-linux platforms?
FWIW I don't think that'll work on FreeBSD. I don't have an arm64
system to test on right now, but I can see that there is no
getauxval() like glibc's. FreeBSD *might* provide the same sort of
information via procstat_getauxv() from libprocstat, but I think maybe
not because I don't see any trace of HWCAP_CRC32 in the tree and I see
a different approach to testing the CPU ID registers in eg
libkern/crc32.c.
So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?
While I'm looking at this:
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
Why? Doesn't something << 62 have the same value and type as
(something << 31) << 31?
+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]
What is this for?
+ if (length & sizeof(uint16)) {
+ CRC32CH(crc32_c, *(uint16*)p_buf);
+ p_buf += sizeof(uint16);
+ }
+
+ if (length & sizeof(uint8)) {
+ CRC32CB(crc32_c, *p_buf);
+ }
From the department of trivialities, our coding style has braces like this:
if (length & sizeof(uint16))
{
CRC32CH(crc32_c, *(uint16*)p_buf);
p_buf += sizeof(uint16);
}
if (length & sizeof(uint8))
CRC32CB(crc32_c, *p_buf);
--
Thomas Munro
http://www.enterprisedb.com
On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:
So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?
It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.
While I'm looking at this:
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))Why? Doesn't something << 62 have the same value and type as
(something << 31) << 31?
+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]
What is this for?
Those probably are damage from using a distribution autoconf rather than
stock autoconf.
Greetings,
Andres Freund
On Thu, Mar 01, 2018 at 01:36:23PM -0800, Andres Freund wrote:
On 2018-01-10 15:09:16 +0900, Michael Paquier wrote:
There are not enough patches for ARM.
Nah, not needing arch specific patches is good ;)
You know already that I am always impressed by your skills in
normalizing things where needed.
--
Michael
Hi,
Thanks for all your comments.
Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this code path.
The mini benchmark code:
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#include "port/pg_crc32c.h"
long int GetTickCount() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}
int main() {
static const uint64_t kSize = 1024 * 1024 + 29;
uint8_t* buf = (uint8_t *)malloc(sizeof(uint8_t) * kSize);
uint32_t i;
srand(0);
for (i = 0; i < kSize; i++) {
buf[i] = (uint8_t)(rand() % 256u);
}
uint32_t kLoop = 1024;
long int start, end;
uint32_t crc = 0xFFFFFFFF;
start = GetTickCount();
for (i = 0; i < kLoop; i++) {
COMP_CRC32C(crc, buf, kSize);
}
end = GetTickCount();
crc ^= 0xFFFFFFFF;
if (kSize < 20) {
for (i = 0; i < kSize; i++) {
printf("%3u,", (uint32_t)buf[i]);
}
printf("\n");
}
printf("crc result = %x, time cost per loop:%f ms\n", crc, (double)(end - start) / kLoop);
free(buf);
return 0;
}
The result shows that optimization get x4.5 speedups on Cortex-A72.
What's the availability of these headers and functions on non-linux platforms?
This Arm64 optimization code only supports linux os so far.
asm(".arch_extension crc");
#define CRC32CB(crc,value) asm("crc32cb %w[c], %w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
It just adds crc flag for Arm gnu gcc compiler.
'crc32cb' might not be recognized in windows or xxxBSD.
So, is it reasonable to add an OS check for Arm64 crc32 optimization.
It means the application calls Arm64 crc32 interface only in linux based OS.
From the department of trivialities, our coding style has braces like this:
if (length & sizeof(uint16))
{
CRC32CH(crc32_c, *(uint16*)p_buf);
p_buf += sizeof(uint16);
}
if (length & sizeof(uint8))
CRC32CB(crc32_c, *p_buf);
Got it. I'll modify it.
+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]
What is this for?
Those probably are damage from using a distribution autoconf rather than stock autoconf.
What does it mean "stock autoconf" ?
Should the configure script be made by specific version autoconf ?
Thanks!
BRs,
Yuqi
-----Original Message-----
From: Andres Freund [mailto:andres@anarazel.de]
Sent: Friday, March 2, 2018 5:36 AM
To: Yuqi Gu <Yuqi.Gu@arm.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: Optimize Arm64 crc32c implementation in Postgresql
Hi,
On 2018-01-10 05:58:19 +0000, Yuqi Gu wrote:
Currently PostgreSQL only implements hardware support for CRC32 checksums for the x86_64 architecture.
Some ARMv8 (AArch64) CPUs implement the CRC32 extension which is
implemented by inline assembly, so they can also benefit from hardware acceleration in IO-intensive workloads.
I would like to propose the patch to optimize crc32c calculation with Arm64 specific instructions.
The hardware-specific code implementation is used under #if defined USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK.
And the performance is improved on platforms: cortex-A57, cortex-A72, cortex-A73, etc.
I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.
Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this codepath.
+
+AC_DEFUN([PGAC_ARM64CE_CRC32_INTRINSICS],
+[AC_CACHE_CHECK([for Arm64ce CRC32],
+[pgac_cv_arm64ce_crc32_intrinsics],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+ [unsigned int arm_flag = 0;
+#if defined(__ARM_ARCH) && (__ARM_ARCH > 7)
+ arm_flag = 1;
+#endif
+ return arm_flag == 1;])],
+ [pgac_cv_arm64ce_crc32_intrinsics="yes"],
+ [pgac_cv_arm64ce_crc32_intrinsics="no"])])
+if test x"$pgac_cv_arm64ce_crc32_intrinsics" = x"yes"; then
+ pgac_arm64ce_crc32_intrinsics=yes
+fi
+])# PGAC_ARM64CE_CRC32_INTRINSICS
I don't understand what this check is supposed to be doing?
AC_LINK_IFELSE doesn't run the program, so I don't see this test achieving anything at all?
* Use slicing-by-8 algorithm.
diff --git a/src/port/pg_crc32c_choose.c b/src/port/pg_crc32c_choose.c
index 40bee67..d3682ad 100644
--- a/src/port/pg_crc32c_choose.c
+++ b/src/port/pg_crc32c_choose.c
@@ -29,6 +29,20 @@
#include "port/pg_crc32c.h"
+#ifdef USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+#ifndef HWCAP_CRC32
+#define HWCAP_CRC32 (1 << 7)
+#endif
+static bool
+pg_crc32c_arm64ce_available(void) {
+ unsigned long auxv = getauxval(AT_HWCAP);
+ return (auxv & HWCAP_CRC32) != 0;
+}
+
+#else
What's the availability of these headers and functions on non-linux platforms?
+#if defined(USE_ARMCE_CRC32C_WITH_RUNTIME_CHECK)
+asm(".arch_extension crc");
So this emitted globally?
+#define LDP(x,y,p) asm("ldp %x[a], %x[b], [%x[c]], #16" :
+[a]"=r"(x),[b]"=r"(y),[c]"+r"(p))
+/* CRC32C: Castagnoli polynomial 0x1EDC6F41 */ #define
+CRC32CX(crc,value) asm("crc32cx %w[c], %w[c], %x[v]" : [c]"+r"(*&crc)
+: [v]"r"(+value)) #define CRC32CW(crc,value) asm("crc32cw %w[c],
+%w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value)) #define
+CRC32CH(crc,value) asm("crc32ch %w[c], %w[c], %w[v]" : [c]"+r"(*&crc)
+: [v]"r"(+value)) #define CRC32CB(crc,value) asm("crc32cb %w[c],
+%w[c], %w[v]" : [c]"+r"(*&crc) : [v]"r"(+value))
+
+pg_crc32c
+pg_comp_crc32c_arm64(pg_crc32c crc, const void* data, size_t len) {
+ uint64 p0, p1;
+ pg_crc32c crc32_c = crc;
+ long length = len;
+ const unsigned char *p_buf = data;
+
+ /* Allow crc instructions in asm */
+ asm(".cpu generic+crc");
Hm, this switches it for the rest of the function, program, ...?
Greetings,
Andres Freund
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi,
On 2018-03-02 09:42:22 +0000, Yuqi Gu wrote:
Could you show whether it actually improves performance? Usually bulk loading data with parallel COPYs is a good way to hit this code path.
The mini benchmark code:
I'd be more interested in a benchmark using postgres itself...
What's the availability of these headers and functions on non-linux platforms?
This Arm64 optimization code only supports linux os so far.
That's not ok for postgres, unfortunately...
What does it mean "stock autoconf" ?
Should the configure script be made by specific version autoconf ?
Yes, it's the version from the autoconf project, rather than with
modifications by distributions. Don't worry about it, the committer can
take care of that.
- Andres
On 02/03/18 06:42, Andres Freund wrote:
On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:
So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.
Anyone got an idea on how to do that? I googled around a bit, but
couldn't find any examples. All the examples I could find very
Linux-specific, and used getauxval(), except for this in the FreeBSD
kernel itself:
https://github.com/freebsd/freebsd/blob/master/sys/libkern/crc32.c#L775.
I'm no expert on FreeBSD, but that doesn't seem suitable for use in a
user program.
In any case, I reworked this patch to follow the example of the existing
code more closely. Notable changes:
* Use compiler intrinsics instead of inline assembly.
* If the target architecture has them, use the CRC instructions without
a runtime check. You'll get that if you use "CFLAGS=armv8.1-a", for
example, as the CRC Extension was made mandatory in ARM v8.1. This
should work even on FreeBSD or other non-Linux systems, where
getauxval() is not available.
* I removed the loop to handle two uint64's at a time, using the LDP
instruction. I couldn't find a compiler intrinsic for that, and it was
actually slower, at least on the system I have access to, than a
straightforward loop that processes 8 bytes at a time.
* I tested this on Linux, with gcc and clang, on an ARM64 virtual
machine that I had available (not an emulator, but a VM on a shared
ARM64 server).
- Heikki
Attachments:
arm64ce-crc32c-1.patchtext/x-patch; name=arm64ce-crc32c-1.patchDownload+419-38
Hi,
On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
On 02/03/18 06:42, Andres Freund wrote:
On 2018-03-02 11:37:52 +1300, Thomas Munro wrote:
So... that stuff probably needs either a configure check for the
getauxval function and/or those headers, or an OS check?It'd probably be better to not rely on os specific headers, and instead
directly access the capabilities.Anyone got an idea on how to do that? I googled around a bit, but couldn't
find any examples.
Similar...
* Use compiler intrinsics instead of inline assembly.
+many
* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).
Have you seen actual postgres performance benefits with the patch?
- Andres
On 01/04/18 20:32, Andres Freund wrote:
On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).Have you seen actual postgres performance benefits with the patch?
I just ran a small test with pg_waldump, similar to what Abhijit
Menon-Sen ran with the Slicing-by-8 and Intel SSE patches, when we added
those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it
took about 7 s, and with the special CPU instructions, about 5 s. 'perf'
showed that the CRC computation took about 30% of the CPU time before,
and about 12% after, which sounds about right. That's not as big a
speedup as we saw with the corresponding Intel SSE instructions back in
2014, but still quite worthwhile.
- Heikki
Hi,
On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:
On 01/04/18 20:32, Andres Freund wrote:
On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).Have you seen actual postgres performance benefits with the patch?
I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.
Cool. Based on a skim the patch looks reasonable. It's a bit sad that
it's effecively linux specific. But I'm not sure we can do anything
about that atm, given the state of the "discovery" APIs on various
platforms.
Greetings,
Andres Freund
On 03/04/18 19:09, Andres Freund wrote:
Hi,
On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:
On 01/04/18 20:32, Andres Freund wrote:
On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).Have you seen actual postgres performance benefits with the patch?
I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.Cool. Based on a skim the patch looks reasonable.
Thanks.
I bikeshedded with myself on the naming of things, and decided to use
"ARMv8" in the variable and file names, instead of ARM64 or ARMCE or
ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional
feature), it's not really related to the 64-bitness, even though the
64-bit instruction set was also introduced in ARM v8. Other than that,
and some comment fixes, this is the same as the previous patch version.
I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've
been testing on, but I don't know. And even if it's OK, would it perform
better if we did 1-byte loads in the beginning, until we reach the
8-byte boundary?
- Heikki
Attachments:
armv8-crc32c-2.patchtext/x-patch; name=armv8-crc32c-2.patchDownload+433-40
Hi,
On 2018-04-03 19:38:42 +0300, Heikki Linnakangas wrote:
I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've been
testing on, but I don't know. And even if it's OK, would it perform better
if we did 1-byte loads in the beginning, until we reach the 8-byte boundary?
Architecture manual time? They're available freely IIRC and should
answer this.
- Andres
Heikki Linnakangas <hlinnaka@iki.fi> writes:
I was just about to commit this, when I started to wonder: Do we need to
worry about alignment? As the patch stands, it will merrily do unaligned
8-byte loads. Is that OK on ARM? It seems to work on the system I've
been testing on, but I don't know. And even if it's OK, would it perform
better if we did 1-byte loads in the beginning, until we reach the
8-byte boundary?
I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually. Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.
regards, tom lane
On 03 Apr 2018, at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 03/04/18 19:09, Andres Freund wrote:
Hi,
On 2018-04-03 19:05:19 +0300, Heikki Linnakangas wrote:On 01/04/18 20:32, Andres Freund wrote:
On 2018-03-06 02:44:35 +0800, Heikki Linnakangas wrote:
* I tested this on Linux, with gcc and clang, on an ARM64 virtual machine
that I had available (not an emulator, but a VM on a shared ARM64 server).Have you seen actual postgres performance benefits with the patch?
I just ran a small test with pg_waldump, similar to what Abhijit Menon-Sen
ran with the Slicing-by-8 and Intel SSE patches, when we added those
(/messages/by-id/20141119155811.GA32492@toroid.org).
I ran pgbench, with scale factor 5, until it had generated about 1 GB of
WAL, and then I ran pg_waldump -z on that WAL. With slicing-by-8, it took
about 7 s, and with the special CPU instructions, about 5 s. 'perf' showed
that the CRC computation took about 30% of the CPU time before, and about
12% after, which sounds about right. That's not as big a speedup as we saw
with the corresponding Intel SSE instructions back in 2014, but still quite
worthwhile.Cool. Based on a skim the patch looks reasonable.
Thanks.
I bikeshedded with myself on the naming of things, and decided to use "ARMv8" in the variable and file names, instead of ARM64 or ARMCE or ARM64CE. The CRC instructions were introduced in ARM v8 (as an optional feature), it's not really related to the 64-bitness, even though the 64-bit instruction set was also introduced in ARM v8. Other than that, and some comment fixes, this is the same as the previous patch version.
Noticed two minor documentation issues in a quick skim of the attached patch:
The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32). While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.
AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])
The documentation in configure.in use “runtime” rather than "run time” in all
lines except this one:
+# uses the CRC instructions, compile both, and select at run time.
cheers ./daniel
On Wed, Apr 4, 2018 at 4:38 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
[armv8-crc32c-2.patch]
This tests OK on my Debian buster aarch64 system (the machine that
runs "eelpout" in the buildfarm), configure tests as expected and I
confirmed that pg_comp_crc32c_armv8 is reached at runtime.
I hope we can figure out a more portable way to detect the
instructions, or failing that a way to detect them on FreeBSD in
userspace. I'll try to send a patch for PG12 if I get a chance.
No opinion on the unaligned memory access question.
--
Thomas Munro
http://www.enterprisedb.com
On 03/04/18 21:56, Daniel Gustafsson wrote:
The following line should say SSE and not SSSE (and the same typo is in
src/include/pg_config.h.in and src/include/pg_config.h.win32). While not
introduced in this patch, it’s adjacent to the patched codepath and on topic so
may well be fixed while in there.AC_DEFINE(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSSE 4.2 CRC instructions with a runtime check.])
I pushed that as a separate commit, as that goes back to 9.5. Also, I
noticed that the description of USE_SLICING_BY_8_CRC32C was completely
wrong, fixed that too. Thanks!
- Heikki
On 03/04/18 19:43, Andres Freund wrote:
Architecture manual time? They're available freely IIRC and should
answer this.
Yeah. The best reference I could find was "ARM Cortex-A Series
Programmer’s Guide for ARMv8-A"
(http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/ch08s01.html).
In the "Porting to A64" section, it says:
Data and code must be aligned to appropriate boundaries. The
alignment of accesses can affect performance on ARM cores and can
represent a portability problem when moving code from an earlier
architecture to ARMv8-A. It is worth being aware of alignment issues
for performance reasons, or when porting code that makes assumptions
about pointers or 32-bit and 64-bit integer variables.
I was a bit surprised by the "must be aligned to appropriate boundaries"
statement. Googling around, the strict alignment requirement was removed
in ARMv7, and since then, unaligned access works similarly to Intel. I
think there are some special instructions, like atomic ops, that require
alignment though. Perhaps that's what that sentence refers to.
On 03/04/18 20:47, Tom Lane wrote:
I'm pretty sure that some ARM platforms emulate unaligned access through
kernel trap handlers, which would certainly make this a lot slower than
handling the unaligned bytes manually. Maybe that doesn't apply to any
ARM CPU that has this instruction ... but as you said, it'd be better
to consider the presence of the instruction as orthogonal to other
CPU features.
I did some quick testing, and found that unaligned access is about 2x
slower than aligned. I don't think it's being trapped by the kernel, I
think that would be even slower, but clearly there is an effect there.
So I added code to process the first 1-7 bytes separately, so that the
main loop runs on 8-byte aligned addresses.
Pushed, thanks everyone!
- Heikki