Popcount optimization using AVX512
This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
My setup:
Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".
1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
4. I tested 5 seeds for each input buffer size and averaged 100 runs each (5*5*100=2500 pg_popcount() calls on a single thread)
5. Data: <See Attached picture.>
The code I wrote uses the 64-bit solution or SW on the memory not aligned to a 512-bit boundary in memory:
///////////////////////////////////////////////////////////////////////
// 512-bit intrisic implementation (AVX512VPOPCNTDQ + AVX512F)
uint64_t popcount_512_impl(const char *bytes, int byteCount) {
#ifdef __AVX__
uint64_t result = 0;
uint64_t remainder = ((uint64_t)bytes) % 64;
result += popcount_64_impl(bytes, remainder);
byteCount -= remainder;
bytes += remainder;
uint64_t vectorCount = byteCount / 64;
remainder = byteCount % 64;
__m512i *vectors = (__m512i *)bytes;
__m512i rv;
while (vectorCount--) {
rv = _mm512_popcnt_epi64(*(vectors++));
result += _mm512_reduce_add_epi64(rv);
}
bytes = (const char *)vectors;
result += popcount_64_impl(bytes, remainder);
return result;
#else
return popcount_64_impl(bytes, byteCount);
#endif
}
There are further optimizations that can be applied here, but for demonstration I added the __AVX__ macro and if not fall back to the original implementations in PostgreSQL.
The 46% improvement in popcount is worthy of discussion considering the previous popcount 64-bit SSE and SW implementations.
Thanks,
Paul Amonson
Attachments:
On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.
My setup:
Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote:
On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
This proposal showcases the speed-up provided to popcount feature when
using AVX512 registers. The intent is to share the preliminary results
with the community and get feedback for adding avx512 support for
popcount.Revisiting the previous discussion/improvements around this feature, I
have created a micro-benchmark based on the pg_popcount() in
PostgreSQL's current implementations for x86_64 using the newer AVX512
intrinsics. Playing with this implementation has improved performance up
to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will
benefit scenarios relying on popcount.
Nice. I've been testing out AVX2 support in src/include/port/simd.h, and
the results look promising there, too. I intend to start a new thread for
that (hopefully soon), but one open question I don't have a great answer
for yet is how to detect support for newer intrinsics. So far, we've been
able to use function pointers (e.g., popcount, crc32c) or deduce support
via common predefined compiler macros (e.g., we assume SSE2 is supported if
the compiler is targeting 64-bit x86). But the former introduces a
performance penalty, and we probably want to inline most of this stuff,
anyway. And the latter limits us to stuff that has been around for a
decade or two.
Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.
Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?
Yeah, the previous optimizations in this area appear to have used ANALYZE
as the benchmark, presumably because of visibilitymap_count(). I briefly
attempted to measure the difference with and without AVX512 support, but I
haven't noticed any difference thus far. One complication for
visiblitymap_count() is that the data passed to pg_popcount64() is masked,
which requires a couple more intructions when you're using the intrinsics.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.
Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization. (One
of the key things you want to be able to say, when trying to sell
a maybe-it-helps-or-maybe-it-doesnt optimization to the PG community,
is "it doesn't hurt anyone who's not able to benefit".) And you
can't argue that that overhead is negligible either, because if it
is then we're all wasting our time even discussing this. So we need
a better technology, and I fear I have no good ideas about what.
Your comment about vectorization hints at one answer: if you can
amortize the overhead across multiple applications of the operation,
then it doesn't hurt so much. But I'm not sure how often we can
make that answer work.
regards, tom lane
On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.
The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html
On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html
Thanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0]https://llvm.org/docs/LangRef.html#ifuncs.
IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.
[0]: https://llvm.org/docs/LangRef.html#ifuncs
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.htmlThanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0].IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.
Agreed, it doesn't solve inline cases. If the gains are big enough, we should
move toward packages containing N CPU-specialized copies of the postgres
binary, with bin/postgres just exec'ing the right one.
On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.htmlThanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0].IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.Agreed, it doesn't solve inline cases. If the gains are big enough, we should
move toward packages containing N CPU-specialized copies of the postgres
binary, with bin/postgres just exec'ing the right one.
I performed a quick test with ifunc on my x86 machine that ordinarily uses
the runtime checks for the CRC32C code, and I actually see a consistent
3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached
the patch used for testing.
The multiple-copies-of-the-postgres-binary idea seems interesting. That's
probably not something that could be enabled by default, but perhaps we
could add support for a build option.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
ifunc_test.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..6db411ee29 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -78,7 +78,7 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len);
#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 41ff4a35ad..62bb981ee8 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -51,14 +51,14 @@ pg_crc32c_sse42_available(void)
* so that subsequent calls are routed directly to the chosen implementation.
*/
static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+(*pg_comp_crc32c_choose (void))(pg_crc32c crc, const void *data, size_t len)
{
if (pg_crc32c_sse42_available())
- pg_comp_crc32c = pg_comp_crc32c_sse42;
+ return pg_comp_crc32c_sse42;
else
- pg_comp_crc32c = pg_comp_crc32c_sb8;
-
- return pg_comp_crc32c(crc, data, len);
+ return pg_comp_crc32c_sb8;
}
-pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c
+pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len)
+ __attribute__ ((ifunc ("pg_comp_crc32c_choose")));
Sorry for the late response here. We spent some time researching and measuring the frequency impact of AVX512 instructions used here.
How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.
AVX512 has light and heavy instructions. While the heavy AVX512 instructions have clock frequency implications, the light instructions not so much. See [0]https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ for more details. We captured EMON data for the benchmark used in this work, and see that the instructions are using the licensing level not meant for heavy AVX512 operations. This means the instructions for popcount : _mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any significant impact on CPU clock frequency.
Clock frequency impact aside, we measured the same benchmark for gains on older Intel hardware and observe up to 18% better performance on Intel Icelake. On older intel hardware, the popcntdq 512 instruction is not present so it won’t work. If clock frequency is not affected, rest of workload should not be impacted in the case of mixed workloads.
Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?
Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the current 64bit behavior shows slightly lower performance, but with a large variance. We cannot conclude much from it. The testing with ANALYZE benchmark by Nathan also points to no visible impact as a result of using AVX512. The gains on larger dataset is easily evident, with less variance.
What are your thoughts if we introduce AVX512 popcount for smaller sizes as an optional feature initially, and then test it more thoroughly over time on this particular use case?
Regarding enablement, following the other responses related to function inlining, using ifunc and enabling future intrinsic support, it seems a concrete solution would require further discussion. We’re attaching a patch to enable AVX512, which can use AVX512 flags during build. For example:
make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl -march=icelake-server -DAVX512_POPCNT=1"
Thoughts or feedback on the approach in the patch? This solution should not impact anyone who doesn’t use the feature i.e. AVX512. Open to additional ideas if this doesn’t seem like the right approach here.
[0]: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Tuesday, November 7, 2023 12:15 PM
To: Noah Misch <noah@leadboat.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; Amonson, Paul D <paul.d.amonson@intel.com>; pgsql-hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
Subject: Re: Popcount optimization using AVX512
On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
The glibc/gcc "ifunc" mechanism was designed to solve this problem
of choosing a function implementation based on the runtime CPU,
without incurring function pointer overhead. I would not attempt
to use AVX512 on non-glibc systems, and I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht
mlThanks, that seems promising for the function pointer cases. I'll
plan on trying to convert one of the existing ones to use it. BTW it
looks like LLVM has something similar [0].IIUC this unfortunately wouldn't help for cases where we wanted to
keep stuff inlined, such as is_valid_ascii() and the functions in
pg_lfind.h, unless we applied it to the calling functions, but that
doesn't ѕound particularly maintainable.Agreed, it doesn't solve inline cases. If the gains are big enough,
we should move toward packages containing N CPU-specialized copies of
the postgres binary, with bin/postgres just exec'ing the right one.
I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and I actually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached the patch used for testing.
The multiple-copies-of-the-postgres-binary idea seems interesting. That's probably not something that could be enabled by default, but perhaps we could add support for a build option.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
proposed_popcnt.patchapplication/octet-stream; name=proposed_popcnt.patchDownload
diff --git a/src/port/Makefile b/src/port/Makefile
index 4320dee0d1..1f6cbe362f 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -93,6 +93,7 @@ pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512)
#
# Shared library versions of object files
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1f3dea2d4b..443b8b63ce 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -21,6 +21,21 @@
#include "port/pg_bitutils.h"
+#if (defined(__linux__) || defined(__linux) || defined(linux))
+#if defined(__x86_64) && defined(AVX512_POPCNT)
+/* Set macro for AVX-512 inclusion in the binary. */
+#define NEED_AVX512_POPCNTDQ 1
+
+#include <immintrin.h>
+
+/* Forward ref for AVX-512 private implementation */
+uint64 popcount_512_impl_unaligned(const char *buf, int bytes);
+#endif /* Platform and Flag for AVX-512 */
+#endif /* Linux */
+
+/* Forward refs for private refactor of 64-bit implementation */
+uint64 popcount_64_impl(const char *buf, int bytes);
+uint64 popcount_impl(const char *buf, int bytes);
/*
* Array giving the position of the left-most set bit for each possible
@@ -288,48 +303,99 @@ pg_popcount64(uint64 word)
#endif /* !TRY_POPCNT_FAST */
+inline uint64
+pg_popcnt_software(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char)*buf++];
+ return popcnt;
+}
+
/*
* pg_popcount
* Returns the number of 1-bits in buf
*/
-uint64
+inline uint64
pg_popcount(const char *buf, int bytes)
-{
- uint64 popcnt = 0;
-
+{ /* Refatored for reuse in AVX-512 implementaitons. */
#if SIZEOF_VOID_P >= 8
/* Process in 64-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
+ return popcount_impl(buf, bytes);
+ else /* If not aligned use software only */
+ return pg_popcnt_software(buf, bytes);
+#else
+ return pg_popcnt_software(buf, bytes);
+#endif
+}
- while (bytes >= 8)
- {
- popcnt += pg_popcount64(*words++);
- bytes -= 8;
- }
+/*
+ * Refatored 64-bit algorithm using the refactored software
+ * algorithm for trailing bytes.
+ */
+inline uint64
+popcount_64_impl(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
- buf = (const char *) words;
- }
-#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
+ while (bytes >= sizeof(uint64))
{
- const uint32 *words = (const uint32 *) buf;
+ popcnt += pg_popcount64(*((const uint64 *)buf));
+ buf += sizeof(uint64);
+ bytes -= sizeof(uint64);
+ }
+
+ /* Process remaining bytes... */
+ popcnt += pg_popcnt_software(buf, bytes);
+ return popcnt;
+}
- while (bytes >= 4)
- {
- popcnt += pg_popcount32(*words++);
- bytes -= 4;
- }
+#if defined(NEED_AVX512_POPCNTDQ)
- buf = (const char *) words;
+#define LINE_SIZE_LOCAL 8192
+/*
+ * AVX-512 implementation for popcount using 64-bit algorithm
+ * for 512-bit unaligned leading and trailing portions.
+ */
+inline uint64
+popcount_512_impl_unaligned(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ uint64 remainder = ((uint64)buf) % 64;
+ popcnt += popcount_64_impl(buf, remainder);
+ bytes -= remainder;
+ buf += remainder;
+
+ __m512i *vectors = (__m512i *)buf;
+ while (bytes >= 64) {
+ popcnt += (uint64)_mm512_reduce_add_epi64(
+ _mm512_popcnt_epi64(*(vectors++)));
+ bytes -= 64;
}
-#endif
-
- /* Process any remaining bytes */
- while (bytes--)
- popcnt += pg_number_of_ones[(unsigned char) *buf++];
+ buf = (const char *)vectors;
+ popcnt += popcount_64_impl(buf, bytes);
return popcnt;
}
+#endif
+
+/*
+ * Called by pg_popcount when architecture is 64-bit and aligned.
+ * Will default to the original 64-bit algorithm if conditions for AVX-512
+ * are not met.
+ */
+inline uint64
+popcount_impl(const char *buf, int bytes)
+{
+#if defined(NEED_AVX512_POPCNTDQ)
+ if(bytes >= 25165824) /* 24MiB */
+ /* After testing, this is the threshhold where benefits for AVX-512
+ starts. */
+ return popcount_512_impl_unaligned(buf, bytes);
+ else
+ return popcount_64_impl(buf, bytes);
+#else
+ return popcount_64_impl(buf, bytes);
+#endif
+}
On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:
AVX512 has light and heavy instructions. While the heavy AVX512
instructions have clock frequency implications, the light instructions
not so much. See [0] for more details. We captured EMON data for the
benchmark used in this work, and see that the instructions are using the
licensing level not meant for heavy AVX512 operations. This means the
instructions for popcount : _mm512_popcnt_epi64(),
_mm512_reduce_add_epi64() are not going to have any significant impact on
CPU clock frequency.Clock frequency impact aside, we measured the same benchmark for gains on
older Intel hardware and observe up to 18% better performance on Intel
Icelake. On older intel hardware, the popcntdq 512 instruction is not
present so it won’t work. If clock frequency is not affected, rest of
workload should not be impacted in the case of mixed workloads.
Thanks for sharing your analysis.
Testing this on smaller block sizes < 8KiB shows that AVX512 compared to
the current 64bit behavior shows slightly lower performance, but with a
large variance. We cannot conclude much from it. The testing with ANALYZE
benchmark by Nathan also points to no visible impact as a result of using
AVX512. The gains on larger dataset is easily evident, with less
variance.What are your thoughts if we introduce AVX512 popcount for smaller sizes
as an optional feature initially, and then test it more thoroughly over
time on this particular use case?
I don't see any need to rush this. At the very earliest, this feature
would go into v17, which doesn't enter feature freeze until April 2024.
That seems like enough time to complete any additional testing you'd like
to do. However, if you are seeing worse performance with this patch, then
it seems unlikely that we'd want to proceed.
Thoughts or feedback on the approach in the patch? This solution should
not impact anyone who doesn’t use the feature i.e. AVX512. Open to
additional ideas if this doesn’t seem like the right approach here.
It's true that it wouldn't impact anyone not using the feature, but there's
also a decent chance that this code goes virtually untested. As I've
stated elsewhere [0]/messages/by-id/20230726043707.GB3211130@nathanxps13, I think we should ensure there's buildfarm coverage
for this kind of architecture-specific stuff.
[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Sorry for the late response. We did some further testing and research on our end, and ended up modifying the AVX512 based algorithm for popcount. We removed a scalar dependency and accumulate the results of popcnt instruction in a zmm register, only performing the reduce add at the very end, similar to [0]/messages/by-id/20230726043707.GB3211130@nathanxps13.
With the updated patch, we observed significant improvements and handily beat the previous popcount algorithm performance. No regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same microbenchmark described initially in this thread.
PG testing:
SQL bit_count() calls popcount. Using a Postgres benchmark calling "select bit_count(bytea(col1)) from mytable" on a table with ~2M text rows, each row 1-12kb in size, we observe (only comparing with 64bit PG implementation, which is the fastest):
1. Entire benchmark using AVX512 implementation vs PG 64-bit impl runs 6-13% faster.
2. Reduce time spent on pg_popcount() method in postgres server during the benchmark:
o 64bit (current PG): 29.5%
o AVX512: 3.3%
3. Reduce number of samples processed by popcount:
o 64bit (current PG): 2.4B samples
o AVX512: 285M samples
Compile above patch (on a machine supporting AVX512 vpopcntdq) using: make all CFLAGS_AVX512="-DHAVE__HW_AVX512_POPCNT -mavx -mavx512vpopcntdq -mavx512f -march=native
Attaching flamegraphs and patch for above observations.
[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13
Thanks,
Akash Shankaran
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, November 15, 2023 1:49 PM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Noah Misch <noah@leadboat.com>; Amonson, Paul D <paul.d.amonson@intel.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:
AVX512 has light and heavy instructions. While the heavy AVX512
instructions have clock frequency implications, the light instructions
not so much. See [0] for more details. We captured EMON data for the
benchmark used in this work, and see that the instructions are using
the licensing level not meant for heavy AVX512 operations. This means
the instructions for popcount : _mm512_popcnt_epi64(),
_mm512_reduce_add_epi64() are not going to have any significant impact
on CPU clock frequency.Clock frequency impact aside, we measured the same benchmark for gains
on older Intel hardware and observe up to 18% better performance on
Intel Icelake. On older intel hardware, the popcntdq 512 instruction
is not present so it won’t work. If clock frequency is not affected,
rest of workload should not be impacted in the case of mixed workloads.
Thanks for sharing your analysis.
Testing this on smaller block sizes < 8KiB shows that AVX512 compared
to the current 64bit behavior shows slightly lower performance, but
with a large variance. We cannot conclude much from it. The testing
with ANALYZE benchmark by Nathan also points to no visible impact as a
result of using AVX512. The gains on larger dataset is easily evident,
with less variance.What are your thoughts if we introduce AVX512 popcount for smaller
sizes as an optional feature initially, and then test it more
thoroughly over time on this particular use case?
I don't see any need to rush this. At the very earliest, this feature would go into v17, which doesn't enter feature freeze until April 2024.
That seems like enough time to complete any additional testing you'd like to do. However, if you are seeing worse performance with this patch, then it seems unlikely that we'd want to proceed.
Thoughts or feedback on the approach in the patch? This solution
should not impact anyone who doesn’t use the feature i.e. AVX512. Open
to additional ideas if this doesn’t seem like the right approach here.
It's true that it wouldn't impact anyone not using the feature, but there's also a decent chance that this code goes virtually untested. As I've stated elsewhere [0]/messages/by-id/20230726043707.GB3211130@nathanxps13, I think we should ensure there's buildfarm coverage for this kind of architecture-specific stuff.
[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
perf-avx512-1.8mrows.svgapplication/octet-stream; name=perf-avx512-1.8mrows.svgDownload
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" width="1200" height="630" onload="init(evt)" viewBox="0 0 1200 630" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<!-- Flame graph stack visualization. See https://github.com/brendangregg/FlameGraph for latest version, and http://www.brendangregg.com/flamegraphs.html for examples. -->
<!-- NOTES: -->
<defs>
<linearGradient id="background" y1="0" y2="1" x1="0" x2="0" >
<stop stop-color="#eeeeee" offset="5%" />
<stop stop-color="#eeeeb0" offset="95%" />
</linearGradient>
</defs>
<style type="text/css">
text { font-family:Verdana; font-size:12px; fill:rgb(0,0,0); }
#search, #ignorecase { opacity:0.1; cursor:pointer; }
#search:hover, #search.show, #ignorecase:hover, #ignorecase.show { opacity:1; }
#subtitle { text-anchor:middle; font-color:rgb(160,160,160); }
#title { text-anchor:middle; font-size:17px}
#unzoom { cursor:pointer; }
#frames > *:hover { stroke:black; stroke-width:0.5; cursor:pointer; }
.hide { display:none; }
.parent { opacity:0.5; }
</style>
<script type="text/ecmascript">
<![CDATA[
"use strict";
var details, searchbtn, unzoombtn, matchedtxt, svg, searching, currentSearchTerm, ignorecase, ignorecaseBtn;
function init(evt) {
details = document.getElementById("details").firstChild;
searchbtn = document.getElementById("search");
ignorecaseBtn = document.getElementById("ignorecase");
unzoombtn = document.getElementById("unzoom");
matchedtxt = document.getElementById("matched");
svg = document.getElementsByTagName("svg")[0];
searching = 0;
currentSearchTerm = null;
// use GET parameters to restore a flamegraphs state.
var params = get_params();
if (params.x && params.y)
zoom(find_group(document.querySelector('[x="' + params.x + '"][y="' + params.y + '"]')));
if (params.s) search(params.s);
}
// event listeners
window.addEventListener("click", function(e) {
var target = find_group(e.target);
if (target) {
if (target.nodeName == "a") {
if (e.ctrlKey === false) return;
e.preventDefault();
}
if (target.classList.contains("parent")) unzoom(true);
zoom(target);
if (!document.querySelector('.parent')) {
// we have basically done a clearzoom so clear the url
var params = get_params();
if (params.x) delete params.x;
if (params.y) delete params.y;
history.replaceState(null, null, parse_params(params));
unzoombtn.classList.add("hide");
return;
}
// set parameters for zoom state
var el = target.querySelector("rect");
if (el && el.attributes && el.attributes.y && el.attributes._orig_x) {
var params = get_params()
params.x = el.attributes._orig_x.value;
params.y = el.attributes.y.value;
history.replaceState(null, null, parse_params(params));
}
}
else if (e.target.id == "unzoom") clearzoom();
else if (e.target.id == "search") search_prompt();
else if (e.target.id == "ignorecase") toggle_ignorecase();
}, false)
// mouse-over for info
// show
window.addEventListener("mouseover", function(e) {
var target = find_group(e.target);
if (target) details.nodeValue = "Function: " + g_to_text(target);
}, false)
// clear
window.addEventListener("mouseout", function(e) {
var target = find_group(e.target);
if (target) details.nodeValue = ' ';
}, false)
// ctrl-F for search
// ctrl-I to toggle case-sensitive search
window.addEventListener("keydown",function (e) {
if (e.keyCode === 114 || (e.ctrlKey && e.keyCode === 70)) {
e.preventDefault();
search_prompt();
}
else if (e.ctrlKey && e.keyCode === 73) {
e.preventDefault();
toggle_ignorecase();
}
}, false)
// functions
function get_params() {
var params = {};
var paramsarr = window.location.search.substr(1).split('&');
for (var i = 0; i < paramsarr.length; ++i) {
var tmp = paramsarr[i].split("=");
if (!tmp[0] || !tmp[1]) continue;
params[tmp[0]] = decodeURIComponent(tmp[1]);
}
return params;
}
function parse_params(params) {
var uri = "?";
for (var key in params) {
uri += key + '=' + encodeURIComponent(params[key]) + '&';
}
if (uri.slice(-1) == "&")
uri = uri.substring(0, uri.length - 1);
if (uri == '?')
uri = window.location.href.split('?')[0];
return uri;
}
function find_child(node, selector) {
var children = node.querySelectorAll(selector);
if (children.length) return children[0];
}
function find_group(node) {
var parent = node.parentElement;
if (!parent) return;
if (parent.id == "frames") return node;
return find_group(parent);
}
function orig_save(e, attr, val) {
if (e.attributes["_orig_" + attr] != undefined) return;
if (e.attributes[attr] == undefined) return;
if (val == undefined) val = e.attributes[attr].value;
e.setAttribute("_orig_" + attr, val);
}
function orig_load(e, attr) {
if (e.attributes["_orig_"+attr] == undefined) return;
e.attributes[attr].value = e.attributes["_orig_" + attr].value;
e.removeAttribute("_orig_"+attr);
}
function g_to_text(e) {
var text = find_child(e, "title").firstChild.nodeValue;
return (text)
}
function g_to_func(e) {
var func = g_to_text(e);
// if there's any manipulation we want to do to the function
// name before it's searched, do it here before returning.
return (func);
}
function update_text(e) {
var r = find_child(e, "rect");
var t = find_child(e, "text");
var w = parseFloat(r.attributes.width.value) -3;
var txt = find_child(e, "title").textContent.replace(/\([^(]*\)$/,"");
t.attributes.x.value = parseFloat(r.attributes.x.value) + 3;
// Smaller than this size won't fit anything
if (w < 2 * 12 * 0.59) {
t.textContent = "";
return;
}
t.textContent = txt;
var sl = t.getSubStringLength(0, txt.length);
// check if only whitespace or if we can fit the entire string into width w
if (/^ *$/.test(txt) || sl < w)
return;
// this isn't perfect, but gives a good starting point
// and avoids calling getSubStringLength too often
var start = Math.floor((w/sl) * txt.length);
for (var x = start; x > 0; x = x-2) {
if (t.getSubStringLength(0, x + 2) <= w) {
t.textContent = txt.substring(0, x) + "..";
return;
}
}
t.textContent = "";
}
// zoom
function zoom_reset(e) {
if (e.attributes != undefined) {
orig_load(e, "x");
orig_load(e, "width");
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_reset(c[i]);
}
}
function zoom_child(e, x, ratio) {
if (e.attributes != undefined) {
if (e.attributes.x != undefined) {
orig_save(e, "x");
e.attributes.x.value = (parseFloat(e.attributes.x.value) - x - 10) * ratio + 10;
if (e.tagName == "text")
e.attributes.x.value = find_child(e.parentNode, "rect[x]").attributes.x.value + 3;
}
if (e.attributes.width != undefined) {
orig_save(e, "width");
e.attributes.width.value = parseFloat(e.attributes.width.value) * ratio;
}
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_child(c[i], x - 10, ratio);
}
}
function zoom_parent(e) {
if (e.attributes) {
if (e.attributes.x != undefined) {
orig_save(e, "x");
e.attributes.x.value = 10;
}
if (e.attributes.width != undefined) {
orig_save(e, "width");
e.attributes.width.value = parseInt(svg.width.baseVal.value) - (10 * 2);
}
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_parent(c[i]);
}
}
function zoom(node) {
var attr = find_child(node, "rect").attributes;
var width = parseFloat(attr.width.value);
var xmin = parseFloat(attr.x.value);
var xmax = parseFloat(xmin + width);
var ymin = parseFloat(attr.y.value);
var ratio = (svg.width.baseVal.value - 2 * 10) / width;
// XXX: Workaround for JavaScript float issues (fix me)
var fudge = 0.0001;
unzoombtn.classList.remove("hide");
var el = document.getElementById("frames").children;
for (var i = 0; i < el.length; i++) {
var e = el[i];
var a = find_child(e, "rect").attributes;
var ex = parseFloat(a.x.value);
var ew = parseFloat(a.width.value);
var upstack;
// Is it an ancestor
if (0 == 0) {
upstack = parseFloat(a.y.value) > ymin;
} else {
upstack = parseFloat(a.y.value) < ymin;
}
if (upstack) {
// Direct ancestor
if (ex <= xmin && (ex+ew+fudge) >= xmax) {
e.classList.add("parent");
zoom_parent(e);
update_text(e);
}
// not in current path
else
e.classList.add("hide");
}
// Children maybe
else {
// no common path
if (ex < xmin || ex + fudge >= xmax) {
e.classList.add("hide");
}
else {
zoom_child(e, xmin, ratio);
update_text(e);
}
}
}
search();
}
function unzoom(dont_update_text) {
unzoombtn.classList.add("hide");
var el = document.getElementById("frames").children;
for(var i = 0; i < el.length; i++) {
el[i].classList.remove("parent");
el[i].classList.remove("hide");
zoom_reset(el[i]);
if(!dont_update_text) update_text(el[i]);
}
search();
}
function clearzoom() {
unzoom();
// remove zoom state
var params = get_params();
if (params.x) delete params.x;
if (params.y) delete params.y;
history.replaceState(null, null, parse_params(params));
}
// search
function toggle_ignorecase() {
ignorecase = !ignorecase;
if (ignorecase) {
ignorecaseBtn.classList.add("show");
} else {
ignorecaseBtn.classList.remove("show");
}
reset_search();
search();
}
function reset_search() {
var el = document.querySelectorAll("#frames rect");
for (var i = 0; i < el.length; i++) {
orig_load(el[i], "fill")
}
var params = get_params();
delete params.s;
history.replaceState(null, null, parse_params(params));
}
function search_prompt() {
if (!searching) {
var term = prompt("Enter a search term (regexp " +
"allowed, eg: ^ext4_)"
+ (ignorecase ? ", ignoring case" : "")
+ "\nPress Ctrl-i to toggle case sensitivity", "");
if (term != null) search(term);
} else {
reset_search();
searching = 0;
currentSearchTerm = null;
searchbtn.classList.remove("show");
searchbtn.firstChild.nodeValue = "Search"
matchedtxt.classList.add("hide");
matchedtxt.firstChild.nodeValue = ""
}
}
function search(term) {
if (term) currentSearchTerm = term;
var re = new RegExp(currentSearchTerm, ignorecase ? 'i' : '');
var el = document.getElementById("frames").children;
var matches = new Object();
var maxwidth = 0;
for (var i = 0; i < el.length; i++) {
var e = el[i];
var func = g_to_func(e);
var rect = find_child(e, "rect");
if (func == null || rect == null)
continue;
// Save max width. Only works as we have a root frame
var w = parseFloat(rect.attributes.width.value);
if (w > maxwidth)
maxwidth = w;
if (func.match(re)) {
// highlight
var x = parseFloat(rect.attributes.x.value);
orig_save(rect, "fill");
rect.attributes.fill.value = "rgb(230,0,230)";
// remember matches
if (matches[x] == undefined) {
matches[x] = w;
} else {
if (w > matches[x]) {
// overwrite with parent
matches[x] = w;
}
}
searching = 1;
}
}
if (!searching)
return;
var params = get_params();
params.s = currentSearchTerm;
history.replaceState(null, null, parse_params(params));
searchbtn.classList.add("show");
searchbtn.firstChild.nodeValue = "Reset Search";
// calculate percent matched, excluding vertical overlap
var count = 0;
var lastx = -1;
var lastw = 0;
var keys = Array();
for (k in matches) {
if (matches.hasOwnProperty(k))
keys.push(k);
}
// sort the matched frames by their x location
// ascending, then width descending
keys.sort(function(a, b){
return a - b;
});
// Step through frames saving only the biggest bottom-up frames
// thanks to the sort order. This relies on the tree property
// where children are always smaller than their parents.
var fudge = 0.0001; // JavaScript floating point
for (var k in keys) {
var x = parseFloat(keys[k]);
var w = matches[keys[k]];
if (x >= lastx + lastw - fudge) {
count += w;
lastx = x;
lastw = w;
}
}
// display matched percent
matchedtxt.classList.remove("hide");
var pct = 100 * count / maxwidth;
if (pct != 100) pct = pct.toFixed(1)
matchedtxt.firstChild.nodeValue = "Matched: " + pct + "%";
}
]]>
</script>
<rect x="0.0" y="0" width="1200.0" height="630.0" fill="url(#background)" />
<text id="title" x="600.00" y="24" >Flame Graph</text>
<text id="details" x="10.00" y="613" > </text>
<text id="unzoom" x="10.00" y="24" class="hide">Reset Zoom</text>
<text id="search" x="1090.00" y="24" >Search</text>
<text id="ignorecase" x="1174.00" y="24" >ic</text>
<text id="matched" x="1090.00" y="613" > </text>
<g id="frames">
<g >
<title>__schedule (9,672,160 samples, 0.08%)</title><rect x="64.4" y="277" width="0.9" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="67.37" y="287.5" ></text>
</g>
<g >
<title>GetPrivateRefCountEntry (22,101,907 samples, 0.18%)</title><rect x="108.4" y="549" width="2.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="111.40" y="559.5" ></text>
</g>
<g >
<title>charge_memcg (50,332,377 samples, 0.40%)</title><rect x="339.1" y="293" width="4.7" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="342.10" y="303.5" ></text>
</g>
<g >
<title>alloc_pages_vma (3,796,401 samples, 0.03%)</title><rect x="952.3" y="293" width="0.3" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="955.25" y="303.5" ></text>
</g>
<g >
<title>handle_pte_fault (10,417,895 samples, 0.08%)</title><rect x="951.8" y="325" width="1.0" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="954.83" y="335.5" ></text>
</g>
<g >
<title>check_cpu_stall (14,327,766 samples, 0.11%)</title><rect x="860.9" y="405" width="1.4" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text x="863.94" y="415.5" ></text>
</g>
<g >
<title>blk_mq_sched_dispatch_requests (68,734,431 samples, 0.55%)</title><rect x="352.9" y="213" width="6.5" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="355.89" y="223.5" ></text>
</g>
<g >
<title>AllocSetReset (100,491,196 samples, 0.80%)</title><rect x="133.2" y="533" width="9.4" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="136.16" y="543.5" ></text>
</g>
<g >
<title>_find_next_bit (2,340,299 samples, 0.02%)</title><rect x="1171.1" y="453" width="0.2" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="1174.06" y="463.5" ></text>
</g>
<g >
<title>rmqueue_bulk (15,939,478 samples, 0.13%)</title><rect x="284.9" y="261" width="1.5" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text x="287.90" y="271.5" ></text>
</g>
<g >
<title>enlargeStringInfo (16,372,638 samples, 0.13%)</title><rect x="211.9" y="517" width="1.6" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="214.91" y="527.5" ></text>
</g>
<g >
<title>enqueue_hrtimer (2,260,540 samples, 0.02%)</title><rect x="1167.6" y="405" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="1170.59" y="415.5" ></text>
</g>
<g >
<title>pick_next_task (4,187,609 samples, 0.03%)</title><rect x="961.5" y="277" width="0.4" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text x="964.53" y="287.5" ></text>
</g>
<g >
<title>lru_cache_add_inactive_or_unevictable (1,941,267 samples, 0.02%)</title><rect x="973.1" y="421" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="976.12" y="431.5" ></text>
</g>
<g >
<title>pg_ulltoa_n.part.0 (48,555,551 samples, 0.39%)</title><rect x="877.9" y="549" width="4.6" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text x="880.89" y="559.5" ></text>
</g>
<g >
<title>sbitmap_get_shallow (75,817,615 samples, 0.61%)</title><rect x="22.1" y="213" width="7.1" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="25.07" y="223.5" ></text>
</g>
<g >
<title>ext4_readahead (102,458,952 samples, 0.82%)</title><rect x="361.0" y="325" width="9.7" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text x="364.02" y="335.5" ></text>
</g>
<g >
<title>handle_irq_event (142,531,340 samples, 1.14%)</title><rect x="984.9" y="373" width="13.4" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="987.88" y="383.5" ></text>
</g>
<g >
<title>LockBufHdr (70,623,215 samples, 0.56%)</title><rect x="114.0" y="549" width="6.6" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text x="116.97" y="559.5" ></text>
</g>
<g >
<title>perf_event_ctx_lock_nested.constprop.0 (3,448,477 samples, 0.03%)</title><rect x="60.6" y="389" width="0.3" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="63.59" y="399.5" ></text>
</g>
<g >
<title>refill_obj_stock (1,696,849 samples, 0.01%)</title><rect x="956.4" y="197" width="0.2" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text x="959.41" y="207.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (14,327,766 samples, 0.11%)</title><rect x="860.9" y="469" width="1.4" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="863.94" y="479.5" ></text>
</g>
<g >
<title>tick_sched_timer (2,114,728 samples, 0.02%)</title><rect x="963.5" y="229" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="966.46" y="239.5" ></text>
</g>
<g >
<title>handle_mm_fault (13,620,363 samples, 0.11%)</title><rect x="951.7" y="357" width="1.3" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="954.69" y="367.5" ></text>
</g>
<g >
<title>kfree (13,737,732 samples, 0.11%)</title><rect x="955.3" y="245" width="1.3" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text x="958.28" y="255.5" ></text>
</g>
<g >
<title>exit_to_user_mode_loop (9,107,583 samples, 0.07%)</title><rect x="963.2" y="357" width="0.9" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="966.24" y="367.5" ></text>
</g>
<g >
<title>get_mem_cgroup_from_mm (4,504,858 samples, 0.04%)</title><rect x="971.9" y="405" width="0.5" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="974.93" y="415.5" ></text>
</g>
<g >
<title>cap_vm_enough_memory (2,761,236 samples, 0.02%)</title><rect x="940.7" y="437" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="943.67" y="447.5" ></text>
</g>
<g >
<title>tick_sched_handle (4,782,572 samples, 0.04%)</title><rect x="1016.1" y="341" width="0.4" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="1019.08" y="351.5" ></text>
</g>
<g >
<title>__schedule (15,695,981 samples, 0.13%)</title><rect x="265.2" y="325" width="1.5" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="268.18" y="335.5" ></text>
</g>
<g >
<title>io_schedule (15,695,981 samples, 0.13%)</title><rect x="265.2" y="357" width="1.5" height="15.0" fill="rgb(245,188,45)" rx="2" ry="2" />
<text x="268.18" y="367.5" ></text>
</g>
<g >
<title>cpuidle_governor_latency_req (1,685,403 samples, 0.01%)</title><rect x="1155.4" y="453" width="0.1" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="1158.37" y="463.5" ></text>
</g>
<g >
<title>blk_mq_run_hw_queue (68,734,431 samples, 0.55%)</title><rect x="352.9" y="261" width="6.5" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="355.89" y="271.5" ></text>
</g>
<g >
<title>add_wait_queue (85,633,946 samples, 0.68%)</title><rect x="82.8" y="405" width="8.1" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="85.85" y="415.5" ></text>
</g>
<g >
<title>schedule (15,695,981 samples, 0.13%)</title><rect x="265.2" y="341" width="1.5" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="268.18" y="351.5" ></text>
</g>
<g >
<title>do_filp_open (80,096,276 samples, 0.64%)</title><rect x="44.5" y="341" width="7.6" height="15.0" fill="rgb(211,28,6)" rx="2" ry="2" />
<text x="47.50" y="351.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (1,740,097 samples, 0.01%)</title><rect x="952.8" y="309" width="0.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="955.81" y="319.5" ></text>
</g>
<g >
<title>get_cpu_device (17,425,313 samples, 0.14%)</title><rect x="1157.6" y="437" width="1.6" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text x="1160.55" y="447.5" ></text>
</g>
<g >
<title>refill_stock (1,696,849 samples, 0.01%)</title><rect x="956.4" y="165" width="0.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="959.41" y="175.5" ></text>
</g>
<g >
<title>__alloc_pages (6,060,947 samples, 0.05%)</title><rect x="972.4" y="405" width="0.5" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="975.36" y="415.5" ></text>
</g>
<g >
<title>acpi_ps_parse_loop (2,175,907 samples, 0.02%)</title><rect x="17.1" y="405" width="0.2" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text x="20.06" y="415.5" ></text>
</g>
<g >
<title>socket_putmessage (21,088,342 samples, 0.17%)</title><rect x="216.9" y="517" width="2.0" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text x="219.87" y="527.5" ></text>
</g>
<g >
<title>vfs_read (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="485" width="153.3" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="228.75" y="495.5" >vfs_read</text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (29,021,961 samples, 0.23%)</title><rect x="1013.8" y="405" width="2.7" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1016.80" y="415.5" ></text>
</g>
<g >
<title>__x64_sys_brk (2,761,236 samples, 0.02%)</title><rect x="940.7" y="485" width="0.2" height="15.0" fill="rgb(220,72,17)" rx="2" ry="2" />
<text x="943.67" y="495.5" ></text>
</g>
<g >
<title>kworker/17:0-ev (50,929,006 samples, 0.41%)</title><rect x="17.3" y="565" width="4.8" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="20.27" y="575.5" ></text>
</g>
<g >
<title>igb_update_stats (50,929,006 samples, 0.41%)</title><rect x="17.3" y="469" width="4.8" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="20.27" y="479.5" ></text>
</g>
<g >
<title>__brk (2,761,236 samples, 0.02%)</title><rect x="940.7" y="533" width="0.2" height="15.0" fill="rgb(225,94,22)" rx="2" ry="2" />
<text x="943.67" y="543.5" ></text>
</g>
<g >
<title>intel_pmu_enable_all (1,516,675 samples, 0.01%)</title><rect x="59.1" y="229" width="0.1" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text x="62.07" y="239.5" ></text>
</g>
<g >
<title>sched_clock (2,035,496 samples, 0.02%)</title><rect x="1170.6" y="405" width="0.2" height="15.0" fill="rgb(212,33,7)" rx="2" ry="2" />
<text x="1173.60" y="415.5" ></text>
</g>
<g >
<title>update_rq_clock (2,035,496 samples, 0.02%)</title><rect x="1170.6" y="437" width="0.2" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="1173.60" y="447.5" ></text>
</g>
<g >
<title>get_page_from_freelist (6,060,947 samples, 0.05%)</title><rect x="972.4" y="389" width="0.5" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="975.36" y="399.5" ></text>
</g>
<g >
<title>mark_page_accessed (50,502,001 samples, 0.40%)</title><rect x="374.3" y="405" width="4.7" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="377.26" y="415.5" ></text>
</g>
<g >
<title>rb_insert_color (73,555,326 samples, 0.59%)</title><rect x="1176.1" y="405" width="6.9" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="1179.05" y="415.5" ></text>
</g>
<g >
<title>clear_page_erms (1,936,357 samples, 0.02%)</title><rect x="952.3" y="245" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="955.25" y="255.5" ></text>
</g>
<g >
<title>irq_enter_rcu (2,848,025 samples, 0.02%)</title><rect x="1016.5" y="405" width="0.3" height="15.0" fill="rgb(208,15,3)" rx="2" ry="2" />
<text x="1019.54" y="415.5" ></text>
</g>
<g >
<title>__handle_mm_fault (88,590,590 samples, 0.71%)</title><rect x="964.9" y="469" width="8.4" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="967.94" y="479.5" ></text>
</g>
<g >
<title>kthread (149,182,557 samples, 1.19%)</title><rect x="29.2" y="533" width="14.1" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="32.22" y="543.5" ></text>
</g>
<g >
<title>unix_stream_recvmsg (40,619,376 samples, 0.32%)</title><rect x="954.4" y="325" width="3.8" height="15.0" fill="rgb(251,213,51)" rx="2" ry="2" />
<text x="957.38" y="335.5" ></text>
</g>
<g >
<title>acpi_ps_get_arguments.constprop.0 (2,175,907 samples, 0.02%)</title><rect x="17.1" y="389" width="0.2" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="20.06" y="399.5" ></text>
</g>
<g >
<title>menu_reflect (70,582,791 samples, 0.56%)</title><rect x="1160.9" y="469" width="6.7" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="1163.93" y="479.5" ></text>
</g>
<g >
<title>hrtimer_try_to_cancel.part.0 (4,846,991 samples, 0.04%)</title><rect x="1175.6" y="437" width="0.5" height="15.0" fill="rgb(253,221,52)" rx="2" ry="2" />
<text x="1178.59" y="447.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="293" width="8.5" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="1023.28" y="303.5" ></text>
</g>
<g >
<title>printtup (20,605,975 samples, 0.16%)</title><rect x="214.9" y="517" width="2.0" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="217.93" y="527.5" ></text>
</g>
<g >
<title>__writeback_single_inode (75,817,615 samples, 0.61%)</title><rect x="22.1" y="405" width="7.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="25.07" y="415.5" ></text>
</g>
<g >
<title>__libc_start_call_main (247,517,398 samples, 1.98%)</title><rect x="940.9" y="549" width="23.4" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="943.93" y="559.5" >_..</text>
</g>
<g >
<title>do_sys_poll (180,905,682 samples, 1.45%)</title><rect x="73.9" y="469" width="17.0" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="76.86" y="479.5" ></text>
</g>
<g >
<title>ret_from_fork (75,817,615 samples, 0.61%)</title><rect x="22.1" y="549" width="7.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="25.07" y="559.5" ></text>
</g>
<g >
<title>page_mkclean (73,212,495 samples, 0.59%)</title><rect x="36.4" y="293" width="6.9" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text x="39.39" y="303.5" ></text>
</g>
<g >
<title>acpi_ev_asynch_execute_gpe_method (2,175,907 samples, 0.02%)</title><rect x="17.1" y="469" width="0.2" height="15.0" fill="rgb(230,115,27)" rx="2" ry="2" />
<text x="20.06" y="479.5" ></text>
</g>
<g >
<title>MemoryContextReset (16,407,870 samples, 0.13%)</title><rect x="208.3" y="517" width="1.6" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="211.32" y="527.5" ></text>
</g>
<g >
<title>kthread (2,175,907 samples, 0.02%)</title><rect x="17.1" y="533" width="0.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="20.06" y="543.5" ></text>
</g>
<g >
<title>evlist__prepare_workload (80,096,276 samples, 0.64%)</title><rect x="44.5" y="485" width="7.6" height="15.0" fill="rgb(237,149,35)" rx="2" ry="2" />
<text x="47.50" y="495.5" ></text>
</g>
<g >
<title>__run_timers.part.0 (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="357" width="8.5" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text x="1023.28" y="367.5" ></text>
</g>
<g >
<title>do_writepages (75,817,615 samples, 0.61%)</title><rect x="22.1" y="389" width="7.1" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="25.07" y="399.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (32,459,806 samples, 0.26%)</title><rect x="346.9" y="293" width="3.1" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="349.94" y="303.5" ></text>
</g>
<g >
<title>mem_cgroup_update_lru_size (14,675,485 samples, 0.12%)</title><rect x="346.9" y="277" width="1.4" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text x="349.94" y="287.5" ></text>
</g>
<g >
<title>xas_find_conflict (16,991,406 samples, 0.14%)</title><rect x="345.3" y="309" width="1.6" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text x="348.34" y="319.5" ></text>
</g>
<g >
<title>_find_next_bit (74,888,722 samples, 0.60%)</title><rect x="10.0" y="373" width="7.1" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="13.00" y="383.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="533" width="153.3" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="228.75" y="543.5" >entry_SYSCALL_64_af..</text>
</g>
<g >
<title>tick_sched_timer (4,782,572 samples, 0.04%)</title><rect x="1016.1" y="357" width="0.4" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1019.08" y="367.5" ></text>
</g>
<g >
<title>alloc_cpumask_var (1,129,027 samples, 0.01%)</title><rect x="65.3" y="405" width="0.1" height="15.0" fill="rgb(248,200,48)" rx="2" ry="2" />
<text x="68.28" y="415.5" ></text>
</g>
<g >
<title>get_next_timer_interrupt (18,391,962 samples, 0.15%)</title><rect x="1159.2" y="405" width="1.7" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="1162.20" y="415.5" ></text>
</g>
<g >
<title>kworker/0:1-eve (2,175,907 samples, 0.02%)</title><rect x="17.1" y="565" width="0.2" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text x="20.06" y="575.5" ></text>
</g>
<g >
<title>show_interrupts (74,888,722 samples, 0.60%)</title><rect x="10.0" y="389" width="7.1" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text x="13.00" y="399.5" ></text>
</g>
<g >
<title>__remove_hrtimer (4,846,991 samples, 0.04%)</title><rect x="1175.6" y="421" width="0.5" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1178.59" y="431.5" ></text>
</g>
<g >
<title>consume_skb (29,760,980 samples, 0.24%)</title><rect x="954.4" y="293" width="2.8" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="957.38" y="303.5" ></text>
</g>
<g >
<title>do_idle (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="437" width="6.8" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="1186.23" y="447.5" ></text>
</g>
<g >
<title>sockfd_lookup_light (1,971,543 samples, 0.02%)</title><rect x="958.2" y="341" width="0.2" height="15.0" fill="rgb(225,95,22)" rx="2" ry="2" />
<text x="961.22" y="351.5" ></text>
</g>
<g >
<title>blk_flush_plug_list (68,734,431 samples, 0.55%)</title><rect x="352.9" y="309" width="6.5" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="355.89" y="319.5" ></text>
</g>
<g >
<title>[unknown] (954,625,795 samples, 7.63%)</title><rect x="128.8" y="549" width="90.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="131.81" y="559.5" >[unknown]</text>
</g>
<g >
<title>ext4_mpage_readpages (102,458,952 samples, 0.82%)</title><rect x="361.0" y="309" width="9.7" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text x="364.02" y="319.5" ></text>
</g>
<g >
<title>__fget_files (1,079,001 samples, 0.01%)</title><rect x="52.2" y="373" width="0.1" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="55.24" y="383.5" ></text>
</g>
<g >
<title>exit_to_user_mode_prepare (2,028,853 samples, 0.02%)</title><rect x="973.3" y="469" width="0.2" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="976.30" y="479.5" ></text>
</g>
<g >
<title>__x64_sys_pread64 (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="501" width="153.3" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="228.75" y="511.5" >__x64_sys_pread64</text>
</g>
<g >
<title>blk_mq_dispatch_rq_list (68,734,431 samples, 0.55%)</title><rect x="352.9" y="149" width="6.5" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="355.89" y="159.5" ></text>
</g>
<g >
<title>get_page_from_freelist (86,401,947 samples, 0.69%)</title><rect x="278.3" y="293" width="8.1" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="281.25" y="303.5" ></text>
</g>
<g >
<title>__skb_datagram_iter (10,858,396 samples, 0.09%)</title><rect x="957.2" y="261" width="1.0" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="960.19" y="271.5" ></text>
</g>
<g >
<title>ctx_resched (18,803,291 samples, 0.15%)</title><rect x="58.5" y="245" width="1.8" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="61.53" y="255.5" ></text>
</g>
<g >
<title>enqueue_entity (6,006,388 samples, 0.05%)</title><rect x="1169.8" y="389" width="0.6" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text x="1172.84" y="399.5" ></text>
</g>
<g >
<title>blk_complete_reqs (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="357" width="7.1" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text x="1009.64" y="367.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (50,789,831 samples, 0.41%)</title><rect x="953.9" y="405" width="4.7" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="956.85" y="415.5" ></text>
</g>
<g >
<title>ata_scsi_queuecmd (31,667,067 samples, 0.25%)</title><rect x="354.7" y="101" width="3.0" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text x="357.74" y="111.5" ></text>
</g>
<g >
<title>load_balance (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="325" width="3.5" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text x="1019.80" y="335.5" ></text>
</g>
<g >
<title>try_charge_memcg (16,552,524 samples, 0.13%)</title><rect x="342.3" y="277" width="1.5" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text x="345.29" y="287.5" ></text>
</g>
<g >
<title>default_wake_function (20,967,575 samples, 0.17%)</title><rect x="387.8" y="357" width="2.0" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text x="390.78" y="367.5" ></text>
</g>
<g >
<title>cmd_record (80,096,276 samples, 0.64%)</title><rect x="44.5" y="501" width="7.6" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="47.50" y="511.5" ></text>
</g>
<g >
<title>blkdev_writepages (75,970,062 samples, 0.61%)</title><rect x="29.2" y="373" width="7.2" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="32.22" y="383.5" ></text>
</g>
<g >
<title>uncharge_batch (1,477,863 samples, 0.01%)</title><rect x="954.8" y="213" width="0.2" height="15.0" fill="rgb(220,72,17)" rx="2" ry="2" />
<text x="957.83" y="223.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_statistics.constprop.0 (14,249,616 samples, 0.11%)</title><rect x="340.9" y="277" width="1.4" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="343.94" y="287.5" ></text>
</g>
<g >
<title>ext4_writepages (75,817,615 samples, 0.61%)</title><rect x="22.1" y="373" width="7.1" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="25.07" y="383.5" ></text>
</g>
<g >
<title>__libc_recv (50,789,831 samples, 0.41%)</title><rect x="953.9" y="421" width="4.7" height="15.0" fill="rgb(233,129,31)" rx="2" ry="2" />
<text x="956.85" y="431.5" ></text>
</g>
<g >
<title>__x64_sys_poll (145,676,058 samples, 1.16%)</title><rect x="90.9" y="469" width="13.8" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="93.92" y="479.5" ></text>
</g>
<g >
<title>__handle_irq_event_percpu (142,531,340 samples, 1.14%)</title><rect x="984.9" y="357" width="13.4" height="15.0" fill="rgb(236,147,35)" rx="2" ry="2" />
<text x="987.88" y="367.5" ></text>
</g>
<g >
<title>wake_up_q (11,868,689 samples, 0.09%)</title><rect x="63.2" y="293" width="1.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="66.25" y="303.5" ></text>
</g>
<g >
<title>hash_bytes (36,976,677 samples, 0.30%)</title><rect x="862.3" y="549" width="3.5" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text x="865.29" y="559.5" ></text>
</g>
<g >
<title>nr_iowait_cpu (2,848,025 samples, 0.02%)</title><rect x="1016.5" y="373" width="0.3" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text x="1019.54" y="383.5" ></text>
</g>
<g >
<title>pg_popcount (284,744,022 samples, 2.28%)</title><rect x="159.5" y="517" width="26.9" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="162.51" y="527.5" >p..</text>
</g>
<g >
<title>acpi_ps_get_next_arg (2,175,907 samples, 0.02%)</title><rect x="17.1" y="373" width="0.2" height="15.0" fill="rgb(241,167,39)" rx="2" ry="2" />
<text x="20.06" y="383.5" ></text>
</g>
<g >
<title>mpage_process_page_bufs (73,212,495 samples, 0.59%)</title><rect x="36.4" y="341" width="6.9" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text x="39.39" y="351.5" ></text>
</g>
<g >
<title>cpu_startup_entry (2,217,817,685 samples, 17.73%)</title><rect x="974.0" y="517" width="209.2" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text x="977.03" y="527.5" >cpu_startup_entry</text>
</g>
<g >
<title>main (231,117,119 samples, 1.85%)</title><rect x="52.1" y="533" width="21.8" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="55.06" y="543.5" >m..</text>
</g>
<g >
<title>lock_page_lruvec_irqsave (15,514,257 samples, 0.12%)</title><rect x="350.0" y="293" width="1.5" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text x="353.00" y="303.5" ></text>
</g>
<g >
<title>cmd_record (231,117,119 samples, 1.85%)</title><rect x="52.1" y="501" width="21.8" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="55.06" y="511.5" >c..</text>
</g>
<g >
<title>sg_pool_alloc (17,372,145 samples, 0.14%)</title><rect x="357.7" y="37" width="1.7" height="15.0" fill="rgb(244,179,43)" rx="2" ry="2" />
<text x="360.73" y="47.5" ></text>
</g>
<g >
<title>ondemand_readahead (1,140,737,731 samples, 9.12%)</title><rect x="266.7" y="373" width="107.6" height="15.0" fill="rgb(251,213,51)" rx="2" ry="2" />
<text x="269.66" y="383.5" >ondemand_read..</text>
</g>
<g >
<title>do_syscall_64 (26,594,942 samples, 0.21%)</title><rect x="63.0" y="421" width="2.5" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="66.03" y="431.5" ></text>
</g>
<g >
<title>alloc_cpumask_var_node (1,146,988 samples, 0.01%)</title><rect x="63.1" y="373" width="0.1" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text x="66.14" y="383.5" ></text>
</g>
<g >
<title>main (80,096,276 samples, 0.64%)</title><rect x="44.5" y="533" width="7.6" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="47.50" y="543.5" ></text>
</g>
<g >
<title>submit_bio (31,205,561 samples, 0.25%)</title><rect x="366.4" y="293" width="3.0" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="369.42" y="303.5" ></text>
</g>
<g >
<title>__fget_light (6,089,936 samples, 0.05%)</title><rect x="960.1" y="341" width="0.5" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="963.06" y="351.5" ></text>
</g>
<g >
<title>all (12,509,618,880 samples, 100%)</title><rect x="10.0" y="581" width="1180.0" height="15.0" fill="rgb(213,39,9)" rx="2" ry="2" />
<text x="13.00" y="591.5" ></text>
</g>
<g >
<title>irq_exit_rcu (15,968,797 samples, 0.13%)</title><rect x="932.8" y="501" width="1.5" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="935.79" y="511.5" ></text>
</g>
<g >
<title>[unknown] (427,847,396 samples, 3.42%)</title><rect x="152.2" y="533" width="40.4" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="155.22" y="543.5" >[un..</text>
</g>
<g >
<title>ata_qc_complete (65,672,873 samples, 0.52%)</title><rect x="992.1" y="277" width="6.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="995.13" y="287.5" ></text>
</g>
<g >
<title>[postgres] (14,280,423 samples, 0.11%)</title><rect x="150.9" y="533" width="1.3" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text x="153.88" y="543.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (16,883,997 samples, 0.13%)</title><rect x="356.1" y="69" width="1.6" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="359.14" y="79.5" ></text>
</g>
<g >
<title>enlargeStringInfo (46,688,075 samples, 0.37%)</title><rect x="155.1" y="517" width="4.4" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="158.11" y="527.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (14,792,943 samples, 0.12%)</title><rect x="158.1" y="453" width="1.4" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="161.12" y="463.5" ></text>
</g>
<g >
<title>__schedule (1,812,237 samples, 0.01%)</title><rect x="960.6" y="309" width="0.2" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="963.64" y="319.5" ></text>
</g>
<g >
<title>perf_event_task_tick (4,782,572 samples, 0.04%)</title><rect x="1016.1" y="293" width="0.4" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text x="1019.08" y="303.5" ></text>
</g>
<g >
<title>_nohz_idle_balance.constprop.0.isra.0 (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="357" width="3.5" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1019.80" y="367.5" ></text>
</g>
<g >
<title>scheduler_tick (14,792,943 samples, 0.12%)</title><rect x="158.1" y="373" width="1.4" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text x="161.12" y="383.5" ></text>
</g>
<g >
<title>account_system_index_time (2,114,728 samples, 0.02%)</title><rect x="963.5" y="149" width="0.2" height="15.0" fill="rgb(213,41,9)" rx="2" ry="2" />
<text x="966.46" y="159.5" ></text>
</g>
<g >
<title>kworker/u290:0- (75,817,615 samples, 0.61%)</title><rect x="22.1" y="565" width="7.1" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="25.07" y="575.5" ></text>
</g>
<g >
<title>tick_sched_handle (14,327,766 samples, 0.11%)</title><rect x="860.9" y="437" width="1.4" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="863.94" y="447.5" ></text>
</g>
<g >
<title>poll_freewait (70,722,188 samples, 0.57%)</title><rect x="98.0" y="437" width="6.7" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="100.99" y="447.5" ></text>
</g>
<g >
<title>__alloc_pages (1,936,357 samples, 0.02%)</title><rect x="952.3" y="277" width="0.1" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="955.25" y="287.5" ></text>
</g>
<g >
<title>__writeback_single_inode (149,182,557 samples, 1.19%)</title><rect x="29.2" y="405" width="14.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="32.22" y="415.5" ></text>
</g>
<g >
<title>alloc_pages (102,668,270 samples, 0.82%)</title><rect x="276.7" y="325" width="9.7" height="15.0" fill="rgb(210,25,6)" rx="2" ry="2" />
<text x="279.72" y="335.5" ></text>
</g>
<g >
<title>perf-exec (80,096,276 samples, 0.64%)</title><rect x="44.5" y="565" width="7.6" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="47.50" y="575.5" ></text>
</g>
<g >
<title>start_secondary (2,217,817,685 samples, 17.73%)</title><rect x="974.0" y="533" width="209.2" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="977.03" y="543.5" >start_secondary</text>
</g>
<g >
<title>perf (557,698,859 samples, 4.46%)</title><rect x="52.1" y="565" width="52.6" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text x="55.06" y="575.5" >perf</text>
</g>
<g >
<title>lapic_next_deadline (2,639,585 samples, 0.02%)</title><rect x="1167.8" y="373" width="0.3" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text x="1170.80" y="383.5" ></text>
</g>
<g >
<title>__submit_bio (75,817,615 samples, 0.61%)</title><rect x="22.1" y="309" width="7.1" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="25.07" y="319.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (14,327,766 samples, 0.11%)</title><rect x="860.9" y="485" width="1.4" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="863.94" y="495.5" ></text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (14,327,766 samples, 0.11%)</title><rect x="860.9" y="501" width="1.4" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="863.94" y="511.5" ></text>
</g>
<g >
<title>import_single_range (2,566,281 samples, 0.02%)</title><rect x="958.4" y="357" width="0.2" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="961.40" y="367.5" ></text>
</g>
<g >
<title>__wake_up_common_lock (20,967,575 samples, 0.17%)</title><rect x="387.8" y="405" width="2.0" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="390.78" y="415.5" ></text>
</g>
<g >
<title>__blk_mq_get_tag (75,817,615 samples, 0.61%)</title><rect x="22.1" y="245" width="7.1" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="25.07" y="255.5" ></text>
</g>
<g >
<title>hrtimer_get_next_event (18,391,962 samples, 0.15%)</title><rect x="1159.2" y="389" width="1.7" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text x="1162.20" y="399.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (33,456,606 samples, 0.27%)</title><rect x="959.9" y="357" width="3.1" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="962.86" y="367.5" ></text>
</g>
<g >
<title>bio_put (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="213" width="7.1" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1009.64" y="223.5" ></text>
</g>
<g >
<title>__pagevec_lru_add (47,974,063 samples, 0.38%)</title><rect x="346.9" y="309" width="4.6" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="349.94" y="319.5" ></text>
</g>
<g >
<title>schedule_hrtimeout_range (18,179,140 samples, 0.15%)</title><rect x="960.6" y="341" width="1.8" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text x="963.64" y="351.5" ></text>
</g>
<g >
<title>do_writepages (149,182,557 samples, 1.19%)</title><rect x="29.2" y="389" width="14.1" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="32.22" y="399.5" ></text>
</g>
<g >
<title>hrtimer_update_next_event (2,900,536 samples, 0.02%)</title><rect x="1175.6" y="405" width="0.3" height="15.0" fill="rgb(225,92,22)" rx="2" ry="2" />
<text x="1178.59" y="415.5" ></text>
</g>
<g >
<title>drain_stock (1,696,849 samples, 0.01%)</title><rect x="956.4" y="149" width="0.2" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text x="959.41" y="159.5" ></text>
</g>
<g >
<title>cpumask_next_wrap (1,938,542 samples, 0.02%)</title><rect x="1171.4" y="453" width="0.2" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text x="1174.45" y="463.5" ></text>
</g>
<g >
<title>task_tick_fair (14,792,943 samples, 0.12%)</title><rect x="158.1" y="357" width="1.4" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="161.12" y="367.5" ></text>
</g>
<g >
<title>ahci_handle_port_intr (65,672,873 samples, 0.52%)</title><rect x="992.1" y="325" width="6.2" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text x="995.13" y="335.5" ></text>
</g>
<g >
<title>ExecStoreBufferHeapTuple (39,610,138 samples, 0.32%)</title><rect x="104.7" y="549" width="3.7" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="107.66" y="559.5" ></text>
</g>
<g >
<title>__GI___execve (80,096,276 samples, 0.64%)</title><rect x="44.5" y="469" width="7.6" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="47.50" y="479.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath.part.0 (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="277" width="8.5" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="1023.28" y="287.5" ></text>
</g>
<g >
<title>cpuidle_enter_state (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="389" width="6.8" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="1186.23" y="399.5" ></text>
</g>
<g >
<title>blk_done_softirq (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="373" width="7.1" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1009.64" y="383.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="325" width="6.8" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1186.23" y="335.5" ></text>
</g>
<g >
<title>ksys_read (74,888,722 samples, 0.60%)</title><rect x="10.0" y="469" width="7.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="13.00" y="479.5" ></text>
</g>
<g >
<title>rebalance_domains (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="341" width="3.5" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text x="1019.80" y="351.5" ></text>
</g>
<g >
<title>run_rebalance_domains (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="373" width="3.5" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="1019.80" y="383.5" ></text>
</g>
<g >
<title>__blk_mq_delay_run_hw_queue (68,734,431 samples, 0.55%)</title><rect x="352.9" y="245" width="6.5" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="355.89" y="255.5" ></text>
</g>
<g >
<title>wb_writeback (149,182,557 samples, 1.19%)</title><rect x="29.2" y="453" width="14.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="32.22" y="463.5" ></text>
</g>
<g >
<title>pick_next_task_idle (2,662,771 samples, 0.02%)</title><rect x="961.7" y="261" width="0.2" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="964.68" y="271.5" ></text>
</g>
<g >
<title>mem_cgroup_from_task (1,740,097 samples, 0.01%)</title><rect x="952.8" y="341" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="955.81" y="351.5" ></text>
</g>
<g >
<title>secondary_startup_64_no_verify (2,289,611,318 samples, 18.30%)</title><rect x="974.0" y="549" width="216.0" height="15.0" fill="rgb(239,157,37)" rx="2" ry="2" />
<text x="977.03" y="559.5" >secondary_startup_64_no_verify</text>
</g>
<g >
<title>prepare_task_switch (9,672,160 samples, 0.08%)</title><rect x="64.4" y="261" width="0.9" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="67.37" y="271.5" ></text>
</g>
<g >
<title>sync_regs (3,981,383 samples, 0.03%)</title><rect x="973.5" y="533" width="0.4" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="976.49" y="543.5" ></text>
</g>
<g >
<title>__x64_sys_ioctl (93,981,159 samples, 0.75%)</title><rect x="52.1" y="421" width="8.9" height="15.0" fill="rgb(249,204,49)" rx="2" ry="2" />
<text x="55.13" y="431.5" ></text>
</g>
<g >
<title>pglz_decompress (549,423,665 samples, 4.39%)</title><rect x="882.5" y="549" width="51.8" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="885.47" y="559.5" >pglz_..</text>
</g>
<g >
<title>__common_interrupt (230,629,896 samples, 1.84%)</title><rect x="984.9" y="405" width="21.7" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text x="987.88" y="415.5" >_..</text>
</g>
<g >
<title>hrtimer_cancel (4,846,991 samples, 0.04%)</title><rect x="1175.6" y="453" width="0.5" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="1178.59" y="463.5" ></text>
</g>
<g >
<title>wb_workfn (149,182,557 samples, 1.19%)</title><rect x="29.2" y="485" width="14.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="32.22" y="495.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (1,740,097 samples, 0.01%)</title><rect x="952.8" y="325" width="0.2" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="955.81" y="335.5" ></text>
</g>
<g >
<title>rcu_idle_exit (16,563,562 samples, 0.13%)</title><rect x="1153.8" y="453" width="1.6" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="1156.81" y="463.5" ></text>
</g>
<g >
<title>mpage_prepare_extent_to_map (73,212,495 samples, 0.59%)</title><rect x="36.4" y="357" width="6.9" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="39.39" y="367.5" ></text>
</g>
<g >
<title>__put_page (4,752,458 samples, 0.04%)</title><rect x="954.8" y="261" width="0.5" height="15.0" fill="rgb(225,96,22)" rx="2" ry="2" />
<text x="957.83" y="271.5" ></text>
</g>
<g >
<title>page_counter_cancel (1,696,849 samples, 0.01%)</title><rect x="956.4" y="117" width="0.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="959.41" y="127.5" ></text>
</g>
<g >
<title>perf_evlist__poll_thread (145,676,058 samples, 1.16%)</title><rect x="90.9" y="533" width="13.8" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text x="93.92" y="543.5" ></text>
</g>
<g >
<title>update_process_times (14,327,766 samples, 0.11%)</title><rect x="860.9" y="421" width="1.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="863.94" y="431.5" ></text>
</g>
<g >
<title>call_timer_fn (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="341" width="8.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1023.28" y="351.5" ></text>
</g>
<g >
<title>generic_writepages (75,970,062 samples, 0.61%)</title><rect x="29.2" y="357" width="7.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="32.22" y="367.5" ></text>
</g>
<g >
<title>lru_cache_add (2,128,259 samples, 0.02%)</title><rect x="952.6" y="277" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="955.61" y="287.5" ></text>
</g>
<g >
<title>__mem_cgroup_uncharge (1,477,863 samples, 0.01%)</title><rect x="954.8" y="229" width="0.2" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text x="957.83" y="239.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (2,114,728 samples, 0.02%)</title><rect x="963.5" y="309" width="0.2" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="966.46" y="319.5" ></text>
</g>
<g >
<title>tick_program_event (1,110,320 samples, 0.01%)</title><rect x="1168.1" y="405" width="0.1" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1171.05" y="415.5" ></text>
</g>
<g >
<title>irqentry_exit_to_user_mode (2,028,853 samples, 0.02%)</title><rect x="973.3" y="485" width="0.2" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="976.30" y="495.5" ></text>
</g>
<g >
<title>__perf_event_enable (25,281,303 samples, 0.20%)</title><rect x="57.9" y="261" width="2.4" height="15.0" fill="rgb(237,149,35)" rx="2" ry="2" />
<text x="60.92" y="271.5" ></text>
</g>
<g >
<title>ttwu_do_activate (10,317,345 samples, 0.08%)</title><rect x="1169.6" y="437" width="1.0" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text x="1172.63" y="447.5" ></text>
</g>
<g >
<title>lru_cache_add (1,941,267 samples, 0.02%)</title><rect x="973.1" y="405" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="976.12" y="415.5" ></text>
</g>
<g >
<title>run_rebalance_domains (15,968,797 samples, 0.13%)</title><rect x="932.8" y="469" width="1.5" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="935.79" y="479.5" ></text>
</g>
<g >
<title>igb_rd32 (50,929,006 samples, 0.41%)</title><rect x="17.3" y="453" width="4.8" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="20.27" y="463.5" ></text>
</g>
<g >
<title>seq_read_iter (74,888,722 samples, 0.60%)</title><rect x="10.0" y="405" width="7.1" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="13.00" y="415.5" ></text>
</g>
<g >
<title>mod_objcg_state (5,070,231 samples, 0.04%)</title><rect x="955.9" y="213" width="0.5" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text x="958.94" y="223.5" ></text>
</g>
<g >
<title>__switch_to (22,325,798 samples, 0.18%)</title><rect x="218.9" y="533" width="2.1" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text x="221.86" y="543.5" ></text>
</g>
<g >
<title>__cond_resched (16,763,600 samples, 0.13%)</title><rect x="225.7" y="421" width="1.6" height="15.0" fill="rgb(217,58,14)" rx="2" ry="2" />
<text x="228.75" y="431.5" ></text>
</g>
<g >
<title>cpuidle_enter (1,907,613,734 samples, 15.25%)</title><rect x="975.4" y="469" width="180.0" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="978.43" y="479.5" >cpuidle_enter</text>
</g>
<g >
<title>psi_group_change (15,695,981 samples, 0.13%)</title><rect x="265.2" y="293" width="1.5" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="268.18" y="303.5" ></text>
</g>
<g >
<title>__wake_up_sync_key (20,967,575 samples, 0.17%)</title><rect x="387.8" y="421" width="2.0" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="390.78" y="431.5" ></text>
</g>
<g >
<title>perf_event_for_each_child (84,441,226 samples, 0.68%)</title><rect x="52.6" y="373" width="8.0" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="55.63" y="383.5" ></text>
</g>
<g >
<title>cpu_startup_entry (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="453" width="6.8" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text x="1186.23" y="463.5" ></text>
</g>
<g >
<title>__kmalloc_node (1,203,109 samples, 0.01%)</title><rect x="63.0" y="373" width="0.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="66.03" y="383.5" ></text>
</g>
<g >
<title>memset (16,299,198 samples, 0.13%)</title><rect x="364.9" y="245" width="1.5" height="15.0" fill="rgb(208,14,3)" rx="2" ry="2" />
<text x="367.89" y="255.5" ></text>
</g>
<g >
<title>StartBufferIO (18,745,888 samples, 0.15%)</title><rect x="125.1" y="549" width="1.7" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="128.07" y="559.5" ></text>
</g>
<g >
<title>xa_load (13,936,765 samples, 0.11%)</title><rect x="369.4" y="293" width="1.3" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="372.37" y="303.5" ></text>
</g>
<g >
<title>perf_poll (74,611,132 samples, 0.60%)</title><rect x="90.9" y="421" width="7.1" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="93.94" y="431.5" ></text>
</g>
<g >
<title>rb_next (24,239,389 samples, 0.19%)</title><rect x="1013.8" y="341" width="2.3" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1016.80" y="351.5" ></text>
</g>
<g >
<title>asm_common_interrupt (306,544,023 samples, 2.45%)</title><rect x="984.9" y="437" width="28.9" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="987.88" y="447.5" >as..</text>
</g>
<g >
<title>generic_exec_single (52,395,966 samples, 0.42%)</title><rect x="55.4" y="309" width="4.9" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="58.36" y="319.5" ></text>
</g>
<g >
<title>refresh_cpu_vm_stats (6,285,959 samples, 0.05%)</title><rect x="1168.8" y="421" width="0.6" height="15.0" fill="rgb(224,91,21)" rx="2" ry="2" />
<text x="1171.82" y="431.5" ></text>
</g>
<g >
<title>irqbalance (74,888,722 samples, 0.60%)</title><rect x="10.0" y="565" width="7.1" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="13.00" y="575.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (15,968,797 samples, 0.13%)</title><rect x="932.8" y="533" width="1.5" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="935.79" y="543.5" ></text>
</g>
<g >
<title>evlist_cpu_iterator__next (3,460,095 samples, 0.03%)</title><rect x="65.6" y="485" width="0.4" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="68.65" y="495.5" ></text>
</g>
<g >
<title>rmap_walk (73,212,495 samples, 0.59%)</title><rect x="36.4" y="277" width="6.9" height="15.0" fill="rgb(224,91,21)" rx="2" ry="2" />
<text x="39.39" y="287.5" ></text>
</g>
<g >
<title>acpi_os_execute_deferred (2,175,907 samples, 0.02%)</title><rect x="17.1" y="485" width="0.2" height="15.0" fill="rgb(236,144,34)" rx="2" ry="2" />
<text x="20.06" y="495.5" ></text>
</g>
<g >
<title>update_sg_lb_stats (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="277" width="3.5" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1019.80" y="287.5" ></text>
</g>
<g >
<title>cpuidle_enter_state (1,789,216,629 samples, 14.30%)</title><rect x="984.9" y="453" width="168.8" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="987.88" y="463.5" >cpuidle_enter_state</text>
</g>
<g >
<title>skb_copy_datagram_iter (10,858,396 samples, 0.09%)</title><rect x="957.2" y="277" width="1.0" height="15.0" fill="rgb(245,188,45)" rx="2" ry="2" />
<text x="960.19" y="287.5" ></text>
</g>
<g >
<title>tick_sched_handle (14,792,943 samples, 0.12%)</title><rect x="158.1" y="405" width="1.4" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="161.12" y="415.5" ></text>
</g>
<g >
<title>run_builtin (231,117,119 samples, 1.85%)</title><rect x="52.1" y="517" width="21.8" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="55.06" y="527.5" >r..</text>
</g>
<g >
<title>_int_malloc (22,348,726 samples, 0.18%)</title><rect x="389.8" y="549" width="2.1" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text x="392.76" y="559.5" ></text>
</g>
<g >
<title>bprm_execve.part.0 (80,096,276 samples, 0.64%)</title><rect x="44.5" y="373" width="7.6" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="47.50" y="383.5" ></text>
</g>
<g >
<title>__libc_start_call_main (231,117,119 samples, 1.85%)</title><rect x="52.1" y="549" width="21.8" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="55.06" y="559.5" >_..</text>
</g>
<g >
<title>_perf_event_enable (77,904,468 samples, 0.62%)</title><rect x="53.0" y="357" width="7.4" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="56.02" y="367.5" ></text>
</g>
<g >
<title>mpage_submit_page (73,212,495 samples, 0.59%)</title><rect x="36.4" y="325" width="6.9" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="39.39" y="335.5" ></text>
</g>
<g >
<title>psi_task_switch (17,158,135 samples, 0.14%)</title><rect x="1174.0" y="453" width="1.6" height="15.0" fill="rgb(230,118,28)" rx="2" ry="2" />
<text x="1176.98" y="463.5" ></text>
</g>
<g >
<title>blk_mq_sched_insert_requests (68,734,431 samples, 0.55%)</title><rect x="352.9" y="277" width="6.5" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="355.89" y="287.5" ></text>
</g>
<g >
<title>__slab_free (2,495,996 samples, 0.02%)</title><rect x="955.5" y="229" width="0.2" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text x="958.51" y="239.5" ></text>
</g>
<g >
<title>__fget_light (1,079,001 samples, 0.01%)</title><rect x="52.2" y="389" width="0.1" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="55.24" y="399.5" ></text>
</g>
<g >
<title>save_fpregs_to_fpstate (1,894,387 samples, 0.02%)</title><rect x="964.1" y="421" width="0.2" height="15.0" fill="rgb(239,157,37)" rx="2" ry="2" />
<text x="967.10" y="431.5" ></text>
</g>
<g >
<title>bio_free (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="197" width="7.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1009.64" y="207.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="357" width="6.8" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1186.23" y="367.5" ></text>
</g>
<g >
<title>tick_nohz_get_sleep_length (18,391,962 samples, 0.15%)</title><rect x="1159.2" y="437" width="1.7" height="15.0" fill="rgb(229,112,26)" rx="2" ry="2" />
<text x="1162.20" y="447.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (26,594,942 samples, 0.21%)</title><rect x="63.0" y="437" width="2.5" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="66.03" y="447.5" ></text>
</g>
<g >
<title>calc_load_nohz_stop (14,891,452 samples, 0.12%)</title><rect x="974.0" y="485" width="1.4" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="977.03" y="495.5" ></text>
</g>
<g >
<title>psql (351,906,959 samples, 2.81%)</title><rect x="940.7" y="565" width="33.2" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="943.67" y="575.5" >psql</text>
</g>
<g >
<title>__errno_location (9,288,899 samples, 0.07%)</title><rect x="953.0" y="421" width="0.9" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="955.98" y="431.5" ></text>
</g>
<g >
<title>sched_clock_cpu (2,165,528 samples, 0.02%)</title><rect x="1153.5" y="437" width="0.2" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="1156.45" y="447.5" ></text>
</g>
<g >
<title>ahci_single_level_irq_intr (142,531,340 samples, 1.14%)</title><rect x="984.9" y="341" width="13.4" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text x="987.88" y="351.5" ></text>
</g>
<g >
<title>ret_from_fork (2,175,907 samples, 0.02%)</title><rect x="17.1" y="549" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="20.06" y="559.5" ></text>
</g>
<g >
<title>scsi_end_request (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="293" width="7.1" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text x="1009.64" y="303.5" ></text>
</g>
<g >
<title>acpi_ns_evaluate (2,175,907 samples, 0.02%)</title><rect x="17.1" y="453" width="0.2" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text x="20.06" y="463.5" ></text>
</g>
<g >
<title>memchr_inv (4,743,854 samples, 0.04%)</title><rect x="1168.2" y="421" width="0.4" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text x="1171.16" y="431.5" ></text>
</g>
<g >
<title>cpuidle_enter (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="405" width="6.8" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="1186.23" y="415.5" ></text>
</g>
<g >
<title>quiet_vmstat (13,353,046 samples, 0.11%)</title><rect x="1168.2" y="437" width="1.2" height="15.0" fill="rgb(249,204,48)" rx="2" ry="2" />
<text x="1171.16" y="447.5" ></text>
</g>
<g >
<title>sock_wfree (2,157,071 samples, 0.02%)</title><rect x="957.0" y="245" width="0.2" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text x="959.99" y="255.5" ></text>
</g>
<g >
<title>perf_mmap__push (83,576,078 samples, 0.67%)</title><rect x="66.0" y="469" width="7.9" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text x="68.97" y="479.5" ></text>
</g>
<g >
<title>dequeue_task_fair (5,781,440 samples, 0.05%)</title><rect x="961.0" y="261" width="0.5" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="963.99" y="271.5" ></text>
</g>
<g >
<title>memcg_slab_free_hook (8,776,558 samples, 0.07%)</title><rect x="955.7" y="229" width="0.9" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="958.75" y="239.5" ></text>
</g>
<g >
<title>__blk_mq_do_dispatch_sched (68,734,431 samples, 0.55%)</title><rect x="352.9" y="165" width="6.5" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="355.89" y="175.5" ></text>
</g>
<g >
<title>do_syscall_64 (180,905,682 samples, 1.45%)</title><rect x="73.9" y="501" width="17.0" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="76.86" y="511.5" ></text>
</g>
<g >
<title>get_mem_cgroup_from_mm (15,778,598 samples, 0.13%)</title><rect x="343.8" y="293" width="1.5" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="346.85" y="303.5" ></text>
</g>
<g >
<title>try_to_wake_up (20,967,575 samples, 0.17%)</title><rect x="387.8" y="341" width="2.0" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text x="390.78" y="351.5" ></text>
</g>
<g >
<title>sched_setaffinity (21,540,849 samples, 0.17%)</title><rect x="63.2" y="389" width="2.1" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="66.25" y="399.5" ></text>
</g>
<g >
<title>submit_bio_noacct (31,205,561 samples, 0.25%)</title><rect x="366.4" y="277" width="3.0" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="369.42" y="287.5" ></text>
</g>
<g >
<title>psi_task_change (2,100,650 samples, 0.02%)</title><rect x="1170.4" y="405" width="0.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="1173.41" y="415.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (6,479,341 samples, 0.05%)</title><rect x="958.9" y="437" width="0.6" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="961.87" y="447.5" ></text>
</g>
<g >
<title>__x64_sys_sched_setaffinity (23,890,946 samples, 0.19%)</title><rect x="63.0" y="405" width="2.3" height="15.0" fill="rgb(227,104,25)" rx="2" ry="2" />
<text x="66.03" y="415.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (2,114,728 samples, 0.02%)</title><rect x="963.5" y="261" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="966.46" y="271.5" ></text>
</g>
<g >
<title>ExecScan (34,016,958 samples, 0.27%)</title><rect x="142.6" y="533" width="3.2" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text x="145.63" y="543.5" ></text>
</g>
<g >
<title>free_pcppages_bulk (1,584,209 samples, 0.01%)</title><rect x="955.1" y="197" width="0.2" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="958.13" y="207.5" ></text>
</g>
<g >
<title>ret_from_fork (12,779,039 samples, 0.10%)</title><rect x="43.3" y="549" width="1.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="46.30" y="559.5" ></text>
</g>
<g >
<title>mempool_alloc (16,299,198 samples, 0.13%)</title><rect x="364.9" y="277" width="1.5" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="367.89" y="287.5" ></text>
</g>
<g >
<title>find_vma (1,765,488 samples, 0.01%)</title><rect x="951.5" y="357" width="0.2" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="954.53" y="367.5" ></text>
</g>
<g >
<title>__schedule (16,366,903 samples, 0.13%)</title><rect x="960.8" y="293" width="1.6" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="963.81" y="303.5" ></text>
</g>
<g >
<title>free_unref_page_commit.constprop.0 (1,584,209 samples, 0.01%)</title><rect x="955.1" y="213" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="958.13" y="223.5" ></text>
</g>
<g >
<title>proc_reg_read_iter (74,888,722 samples, 0.60%)</title><rect x="10.0" y="421" width="7.1" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="13.00" y="431.5" ></text>
</g>
<g >
<title>handle_edge_irq (230,629,896 samples, 1.84%)</title><rect x="984.9" y="389" width="21.7" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="987.88" y="399.5" >h..</text>
</g>
<g >
<title>scsi_dispatch_cmd (31,667,067 samples, 0.25%)</title><rect x="354.7" y="117" width="3.0" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="357.74" y="127.5" ></text>
</g>
<g >
<title>skb_release_head_state (6,556,399 samples, 0.05%)</title><rect x="956.6" y="277" width="0.6" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="959.57" y="287.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (281,202,092 samples, 2.25%)</title><rect x="234.2" y="405" width="26.5" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="237.15" y="415.5" >c..</text>
</g>
<g >
<title>load_balance (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="261" width="6.8" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text x="1186.23" y="271.5" ></text>
</g>
<g >
<title>[psql] (247,517,398 samples, 1.98%)</title><rect x="940.9" y="485" width="23.4" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="943.93" y="495.5" >[..</text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (158,736,506 samples, 1.27%)</title><rect x="1013.8" y="437" width="15.0" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="1016.80" y="447.5" ></text>
</g>
<g >
<title>unix_destruct_scm (2,157,071 samples, 0.02%)</title><rect x="957.0" y="261" width="0.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="959.99" y="271.5" ></text>
</g>
<g >
<title>write_cache_pages (75,970,062 samples, 0.61%)</title><rect x="29.2" y="341" width="7.2" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text x="32.22" y="351.5" ></text>
</g>
<g >
<title>get_page_from_freelist (1,936,357 samples, 0.02%)</title><rect x="952.3" y="261" width="0.1" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="955.25" y="271.5" ></text>
</g>
<g >
<title>migration/104 (12,779,039 samples, 0.10%)</title><rect x="43.3" y="565" width="1.2" height="15.0" fill="rgb(231,121,28)" rx="2" ry="2" />
<text x="46.30" y="575.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (29,021,961 samples, 0.23%)</title><rect x="1013.8" y="389" width="2.7" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1016.80" y="399.5" ></text>
</g>
<g >
<title>__GI___ioctl (95,475,665 samples, 0.76%)</title><rect x="52.1" y="469" width="9.0" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="55.06" y="479.5" ></text>
</g>
<g >
<title>perf_mmap__read_head (83,576,078 samples, 0.67%)</title><rect x="66.0" y="453" width="7.9" height="15.0" fill="rgb(220,72,17)" rx="2" ry="2" />
<text x="68.97" y="463.5" ></text>
</g>
<g >
<title>prepare_task_switch (1,905,408 samples, 0.02%)</title><rect x="1173.8" y="453" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="1176.80" y="463.5" ></text>
</g>
<g >
<title>byteain (4,987,165,096 samples, 39.87%)</title><rect x="391.9" y="549" width="470.4" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="394.86" y="559.5" >byteain</text>
</g>
<g >
<title>sched_setaffinity (1,574,969 samples, 0.01%)</title><rect x="65.4" y="405" width="0.1" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="68.39" y="415.5" ></text>
</g>
<g >
<title>irq_exit_rcu (75,914,127 samples, 0.61%)</title><rect x="1006.6" y="405" width="7.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1009.64" y="415.5" ></text>
</g>
<g >
<title>do_dentry_open (80,096,276 samples, 0.64%)</title><rect x="44.5" y="277" width="7.6" height="15.0" fill="rgb(243,175,41)" rx="2" ry="2" />
<text x="47.50" y="287.5" ></text>
</g>
<g >
<title>unix_stream_read_actor (10,858,396 samples, 0.09%)</title><rect x="957.2" y="293" width="1.0" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="960.19" y="303.5" ></text>
</g>
<g >
<title>acpi_ps_parse_aml (2,175,907 samples, 0.02%)</title><rect x="17.1" y="421" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="20.06" y="431.5" ></text>
</g>
<g >
<title>process_one_work (2,175,907 samples, 0.02%)</title><rect x="17.1" y="501" width="0.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="20.06" y="511.5" ></text>
</g>
<g >
<title>dequeue_entity (2,144,979 samples, 0.02%)</title><rect x="961.0" y="245" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="963.99" y="255.5" ></text>
</g>
<g >
<title>__sys_sendto (20,967,575 samples, 0.17%)</title><rect x="387.8" y="485" width="2.0" height="15.0" fill="rgb(236,146,34)" rx="2" ry="2" />
<text x="390.78" y="495.5" ></text>
</g>
<g >
<title>start_thread (145,676,058 samples, 1.16%)</title><rect x="90.9" y="549" width="13.8" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="93.92" y="559.5" ></text>
</g>
<g >
<title>blk_mq_flush_plug_list (17,457,373 samples, 0.14%)</title><rect x="359.4" y="309" width="1.6" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="362.37" y="319.5" ></text>
</g>
<g >
<title>printtup_startup (129,072,850 samples, 1.03%)</title><rect x="206.7" y="533" width="12.2" height="15.0" fill="rgb(239,156,37)" rx="2" ry="2" />
<text x="209.68" y="543.5" ></text>
</g>
<g >
<title>page_counter_uncharge (1,696,849 samples, 0.01%)</title><rect x="956.4" y="133" width="0.2" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="959.41" y="143.5" ></text>
</g>
<g >
<title>evlist_cpu_iterator__next (47,408,765 samples, 0.38%)</title><rect x="61.1" y="469" width="4.4" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="64.06" y="479.5" ></text>
</g>
<g >
<title>path_openat (80,096,276 samples, 0.64%)</title><rect x="44.5" y="325" width="7.6" height="15.0" fill="rgb(249,202,48)" rx="2" ry="2" />
<text x="47.50" y="335.5" ></text>
</g>
<g >
<title>do_syscall_64 (93,981,159 samples, 0.75%)</title><rect x="52.1" y="437" width="8.9" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="55.13" y="447.5" ></text>
</g>
<g >
<title>remote_function (45,033,678 samples, 0.36%)</title><rect x="56.1" y="293" width="4.2" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="59.06" y="303.5" ></text>
</g>
<g >
<title>__handle_mm_fault (11,880,266 samples, 0.09%)</title><rect x="951.7" y="341" width="1.1" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="954.69" y="351.5" ></text>
</g>
<g >
<title>__schedule (39,589,747 samples, 0.32%)</title><rect x="1171.9" y="469" width="3.7" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="1174.86" y="479.5" ></text>
</g>
<g >
<title>__poll (50,835,272 samples, 0.41%)</title><rect x="959.5" y="437" width="4.8" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="962.48" y="447.5" ></text>
</g>
<g >
<title>psi_group_change (2,211,186 samples, 0.02%)</title><rect x="961.9" y="277" width="0.2" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="964.93" y="287.5" ></text>
</g>
<g >
<title>_raw_spin_lock (15,907,773 samples, 0.13%)</title><rect x="367.9" y="213" width="1.5" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text x="370.87" y="223.5" ></text>
</g>
<g >
<title>handle_pte_fault (88,590,590 samples, 0.71%)</title><rect x="964.9" y="453" width="8.4" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="967.94" y="463.5" ></text>
</g>
<g >
<title>ttwu_queue_wakelist (20,967,575 samples, 0.17%)</title><rect x="387.8" y="325" width="2.0" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="390.78" y="335.5" ></text>
</g>
<g >
<title>__x64_sys_poll (37,410,990 samples, 0.30%)</title><rect x="959.5" y="389" width="3.5" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="962.48" y="399.5" ></text>
</g>
<g >
<title>clockevents_program_event (1,110,320 samples, 0.01%)</title><rect x="1168.1" y="389" width="0.1" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="1171.05" y="399.5" ></text>
</g>
<g >
<title>vma_interval_tree_subtree_search (73,212,495 samples, 0.59%)</title><rect x="36.4" y="229" width="6.9" height="15.0" fill="rgb(219,65,15)" rx="2" ry="2" />
<text x="39.39" y="239.5" ></text>
</g>
<g >
<title>__poll (145,676,058 samples, 1.16%)</title><rect x="90.9" y="517" width="13.8" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="93.92" y="527.5" ></text>
</g>
<g >
<title>ret_from_fork (149,182,557 samples, 1.19%)</title><rect x="29.2" y="549" width="14.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="32.22" y="559.5" ></text>
</g>
<g >
<title>event_function_call (77,141,337 samples, 0.62%)</title><rect x="53.1" y="341" width="7.3" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="56.10" y="351.5" ></text>
</g>
<g >
<title>__blk_mq_alloc_request (15,297,788 samples, 0.12%)</title><rect x="366.4" y="229" width="1.5" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="369.42" y="239.5" ></text>
</g>
<g >
<title>ExecScanFetch (16,496,656 samples, 0.13%)</title><rect x="145.8" y="533" width="1.6" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="148.84" y="543.5" ></text>
</g>
<g >
<title>alloc_cpumask_var (2,350,097 samples, 0.02%)</title><rect x="63.0" y="389" width="0.2" height="15.0" fill="rgb(248,200,48)" rx="2" ry="2" />
<text x="66.03" y="399.5" ></text>
</g>
<g >
<title>do_open_execat (80,096,276 samples, 0.64%)</title><rect x="44.5" y="357" width="7.6" height="15.0" fill="rgb(223,87,20)" rx="2" ry="2" />
<text x="47.50" y="367.5" ></text>
</g>
<g >
<title>x86_64_start_reservations (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="517" width="6.8" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="1186.23" y="527.5" ></text>
</g>
<g >
<title>wb_workfn (75,817,615 samples, 0.61%)</title><rect x="22.1" y="485" width="7.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="25.07" y="495.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (6,010,445 samples, 0.05%)</title><rect x="1167.6" y="437" width="0.6" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1170.59" y="447.5" ></text>
</g>
<g >
<title>mutex_lock (74,160,183 samples, 0.59%)</title><rect x="91.0" y="405" width="7.0" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="93.98" y="415.5" ></text>
</g>
<g >
<title>nohz_run_idle_balance (11,279,403 samples, 0.09%)</title><rect x="1170.8" y="485" width="1.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1173.80" y="495.5" ></text>
</g>
<g >
<title>page_mapping (1,941,267 samples, 0.02%)</title><rect x="973.1" y="373" width="0.2" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="976.12" y="383.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (55,582,072 samples, 0.44%)</title><rect x="865.8" y="549" width="5.2" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text x="868.78" y="559.5" ></text>
</g>
<g >
<title>_nohz_idle_balance.constprop.0.isra.0 (11,279,403 samples, 0.09%)</title><rect x="1170.8" y="469" width="1.1" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1173.80" y="479.5" ></text>
</g>
<g >
<title>AllocSetReset (17,398,031 samples, 0.14%)</title><rect x="206.7" y="517" width="1.6" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="209.68" y="527.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (10,858,396 samples, 0.09%)</title><rect x="957.2" y="229" width="1.0" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="960.19" y="239.5" ></text>
</g>
<g >
<title>irqentry_exit (2,028,853 samples, 0.02%)</title><rect x="973.3" y="501" width="0.2" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="976.30" y="511.5" ></text>
</g>
<g >
<title>update_load_avg (2,144,979 samples, 0.02%)</title><rect x="961.0" y="229" width="0.2" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="963.99" y="239.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (85,633,946 samples, 0.68%)</title><rect x="82.8" y="389" width="8.1" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="85.85" y="399.5" ></text>
</g>
<g >
<title>do_sys_poll (145,676,058 samples, 1.16%)</title><rect x="90.9" y="453" width="13.8" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="93.92" y="463.5" ></text>
</g>
<g >
<title>__add_to_page_cache_locked (605,688,650 samples, 4.84%)</title><rect x="289.8" y="325" width="57.1" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="292.81" y="335.5" >__add_..</text>
</g>
<g >
<title>ata_qc_complete_multiple (65,672,873 samples, 0.52%)</title><rect x="992.1" y="293" width="6.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="995.13" y="303.5" ></text>
</g>
<g >
<title>mix_interrupt_randomness (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="325" width="8.5" height="15.0" fill="rgb(246,192,45)" rx="2" ry="2" />
<text x="1023.28" y="335.5" ></text>
</g>
<g >
<title>__remove_hrtimer (24,239,389 samples, 0.19%)</title><rect x="1013.8" y="357" width="2.3" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1016.80" y="367.5" ></text>
</g>
<g >
<title>worker_thread (50,929,006 samples, 0.41%)</title><rect x="17.3" y="517" width="4.8" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="20.27" y="527.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (14,792,943 samples, 0.12%)</title><rect x="158.1" y="485" width="1.4" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="161.12" y="495.5" ></text>
</g>
<g >
<title>perf_cpu_map__idx (2,305,478 samples, 0.02%)</title><rect x="61.5" y="453" width="0.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="64.50" y="463.5" ></text>
</g>
<g >
<title>page_counter_cancel (1,477,863 samples, 0.01%)</title><rect x="954.8" y="181" width="0.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="957.83" y="191.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (2,761,236 samples, 0.02%)</title><rect x="940.7" y="517" width="0.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="943.67" y="527.5" ></text>
</g>
<g >
<title>tick_sched_timer (14,792,943 samples, 0.12%)</title><rect x="158.1" y="421" width="1.4" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="161.12" y="431.5" ></text>
</g>
<g >
<title>evsel__enable_cpu (1,196,516 samples, 0.01%)</title><rect x="65.5" y="469" width="0.1" height="15.0" fill="rgb(223,83,19)" rx="2" ry="2" />
<text x="68.54" y="479.5" ></text>
</g>
<g >
<title>psi_group_change (16,965,660 samples, 0.14%)</title><rect x="1174.0" y="437" width="1.6" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="1176.99" y="447.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (94,727,321 samples, 0.76%)</title><rect x="52.1" y="453" width="8.9" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="55.06" y="463.5" ></text>
</g>
<g >
<title>free_compound_page (4,752,458 samples, 0.04%)</title><rect x="954.8" y="245" width="0.5" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="957.83" y="255.5" ></text>
</g>
<g >
<title>update_sd_lb_stats.constprop.0 (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="229" width="6.8" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="1186.23" y="239.5" ></text>
</g>
<g >
<title>pgstat_count_io_op_time (15,360,430 samples, 0.12%)</title><rect x="934.3" y="549" width="1.4" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="937.29" y="559.5" ></text>
</g>
<g >
<title>blk_finish_plug (86,191,804 samples, 0.69%)</title><rect x="352.9" y="325" width="8.1" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="355.89" y="335.5" ></text>
</g>
<g >
<title>do_user_addr_fault (15,385,851 samples, 0.12%)</title><rect x="951.5" y="373" width="1.5" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="954.53" y="383.5" ></text>
</g>
<g >
<title>sg_alloc_table_chained (17,372,145 samples, 0.14%)</title><rect x="357.7" y="53" width="1.7" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="360.73" y="63.5" ></text>
</g>
<g >
<title>clockevents_program_event (2,639,585 samples, 0.02%)</title><rect x="1167.8" y="389" width="0.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="1170.80" y="399.5" ></text>
</g>
<g >
<title>syscall_exit_to_user_mode (9,107,583 samples, 0.07%)</title><rect x="963.2" y="389" width="0.9" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text x="966.24" y="399.5" ></text>
</g>
<g >
<title>ext4_io_submit (75,817,615 samples, 0.61%)</title><rect x="22.1" y="357" width="7.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text x="25.07" y="367.5" ></text>
</g>
<g >
<title>idle_cpu (2,442,574 samples, 0.02%)</title><rect x="1171.6" y="453" width="0.3" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text x="1174.63" y="463.5" ></text>
</g>
<g >
<title>sock_sendmsg (20,967,575 samples, 0.17%)</title><rect x="387.8" y="469" width="2.0" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="390.78" y="479.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (1,740,097 samples, 0.01%)</title><rect x="952.8" y="277" width="0.2" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="955.81" y="287.5" ></text>
</g>
<g >
<title>__next_zones_zonelist (1,860,044 samples, 0.01%)</title><rect x="952.4" y="277" width="0.2" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="955.44" y="287.5" ></text>
</g>
<g >
<title>stop_one_cpu_nowait (11,868,689 samples, 0.09%)</title><rect x="63.2" y="325" width="1.2" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="66.25" y="335.5" ></text>
</g>
<g >
<title>__alloc_pages (102,668,270 samples, 0.82%)</title><rect x="276.7" y="309" width="9.7" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="279.72" y="319.5" ></text>
</g>
<g >
<title>__send (20,967,575 samples, 0.17%)</title><rect x="387.8" y="549" width="2.0" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="390.78" y="559.5" ></text>
</g>
<g >
<title>__libc_start_call_main (80,096,276 samples, 0.64%)</title><rect x="44.5" y="549" width="7.6" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="47.50" y="559.5" ></text>
</g>
<g >
<title>do_syscall_64 (2,761,236 samples, 0.02%)</title><rect x="940.7" y="501" width="0.2" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="943.67" y="511.5" ></text>
</g>
<g >
<title>update_process_times (2,114,728 samples, 0.02%)</title><rect x="963.5" y="197" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="966.46" y="207.5" ></text>
</g>
<g >
<title>[psql] (247,517,398 samples, 1.98%)</title><rect x="940.9" y="533" width="23.4" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="943.93" y="543.5" >[..</text>
</g>
<g >
<title>irq_exit_rcu (1,740,097 samples, 0.01%)</title><rect x="952.8" y="293" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="955.81" y="303.5" ></text>
</g>
<g >
<title>kmalloc_slab (1,146,988 samples, 0.01%)</title><rect x="63.1" y="341" width="0.1" height="15.0" fill="rgb(225,95,22)" rx="2" ry="2" />
<text x="66.14" y="351.5" ></text>
</g>
<g >
<title>kthread (50,929,006 samples, 0.41%)</title><rect x="17.3" y="533" width="4.8" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="20.27" y="543.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (247,517,398 samples, 1.98%)</title><rect x="940.9" y="469" width="23.4" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="943.93" y="479.5" >[..</text>
</g>
<g >
<title>__fdget (1,079,001 samples, 0.01%)</title><rect x="52.2" y="405" width="0.1" height="15.0" fill="rgb(214,41,10)" rx="2" ry="2" />
<text x="55.24" y="415.5" ></text>
</g>
<g >
<title>do_sys_poll (37,410,990 samples, 0.30%)</title><rect x="959.5" y="373" width="3.5" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="962.48" y="383.5" ></text>
</g>
<g >
<title>exit_to_user_mode_prepare (9,107,583 samples, 0.07%)</title><rect x="963.2" y="373" width="0.9" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="966.24" y="383.5" ></text>
</g>
<g >
<title>__next_zones_zonelist (1,991,086 samples, 0.02%)</title><rect x="972.9" y="405" width="0.2" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="975.93" y="415.5" ></text>
</g>
<g >
<title>mutex_lock (2,315,083 samples, 0.02%)</title><rect x="60.4" y="357" width="0.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="63.37" y="367.5" ></text>
</g>
<g >
<title>sd_init_command (17,372,145 samples, 0.14%)</title><rect x="357.7" y="101" width="1.7" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="360.73" y="111.5" ></text>
</g>
<g >
<title>rmap_walk_file (73,212,495 samples, 0.59%)</title><rect x="36.4" y="261" width="6.9" height="15.0" fill="rgb(223,87,20)" rx="2" ry="2" />
<text x="39.39" y="271.5" ></text>
</g>
<g >
<title>ext4_file_read_iter (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="453" width="153.3" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="228.75" y="463.5" >ext4_file_read_iter</text>
</g>
<g >
<title>__x64_sys_recvfrom (1,974,040 samples, 0.02%)</title><rect x="953.9" y="389" width="0.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="956.85" y="399.5" ></text>
</g>
<g >
<title>cpu_stop_queue_work (11,868,689 samples, 0.09%)</title><rect x="63.2" y="309" width="1.2" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text x="66.25" y="319.5" ></text>
</g>
<g >
<title>_int_malloc (101,628,325 samples, 0.81%)</title><rect x="964.3" y="549" width="9.6" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text x="967.28" y="559.5" ></text>
</g>
<g >
<title>unix_stream_read_generic (1,790,304 samples, 0.01%)</title><rect x="954.2" y="325" width="0.2" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="957.22" y="335.5" ></text>
</g>
<g >
<title>ahci_handle_port_interrupt (65,672,873 samples, 0.52%)</title><rect x="992.1" y="309" width="6.2" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="995.13" y="319.5" ></text>
</g>
<g >
<title>send_call_function_single_ipi (20,967,575 samples, 0.17%)</title><rect x="387.8" y="293" width="2.0" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text x="390.78" y="303.5" ></text>
</g>
<g >
<title>run_rebalance_domains (1,740,097 samples, 0.01%)</title><rect x="952.8" y="261" width="0.2" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="955.81" y="271.5" ></text>
</g>
<g >
<title>__mem_cgroup_charge (66,110,975 samples, 0.53%)</title><rect x="339.1" y="309" width="6.2" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="342.10" y="319.5" ></text>
</g>
<g >
<title>__irqentry_text_end (1,944,598 samples, 0.02%)</title><rect x="964.6" y="533" width="0.1" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="967.55" y="543.5" ></text>
</g>
<g >
<title>wait_for_completion (9,672,160 samples, 0.08%)</title><rect x="64.4" y="325" width="0.9" height="15.0" fill="rgb(211,28,6)" rx="2" ry="2" />
<text x="67.37" y="335.5" ></text>
</g>
<g >
<title>free_unref_page (3,274,595 samples, 0.03%)</title><rect x="955.0" y="229" width="0.3" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="957.97" y="239.5" ></text>
</g>
<g >
<title>TerminateBufferIO (20,953,788 samples, 0.17%)</title><rect x="126.8" y="549" width="2.0" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text x="129.83" y="559.5" ></text>
</g>
<g >
<title>sched_clock (2,165,528 samples, 0.02%)</title><rect x="1153.5" y="421" width="0.2" height="15.0" fill="rgb(212,33,7)" rx="2" ry="2" />
<text x="1156.45" y="431.5" ></text>
</g>
<g >
<title>__libc_pread (1,697,979,625 samples, 13.57%)</title><rect x="218.9" y="549" width="160.1" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="221.86" y="559.5" >__libc_pread</text>
</g>
<g >
<title>do_anonymous_page (10,417,895 samples, 0.08%)</title><rect x="951.8" y="309" width="1.0" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="954.83" y="319.5" ></text>
</g>
<g >
<title>writeback_sb_inodes (149,182,557 samples, 1.19%)</title><rect x="29.2" y="421" width="14.1" height="15.0" fill="rgb(237,148,35)" rx="2" ry="2" />
<text x="32.22" y="431.5" ></text>
</g>
<g >
<title>affine_move_task (21,540,849 samples, 0.17%)</title><rect x="63.2" y="341" width="2.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="66.25" y="351.5" ></text>
</g>
<g >
<title>LWLockAttemptLock (36,902,533 samples, 0.29%)</title><rect x="110.5" y="549" width="3.5" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text x="113.49" y="559.5" ></text>
</g>
<g >
<title>__evlist__enable.constprop.0 (144,080,946 samples, 1.15%)</title><rect x="52.1" y="485" width="13.5" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="55.06" y="495.5" ></text>
</g>
<g >
<title>pick_next_task_fair (18,232,637 samples, 0.15%)</title><rect x="1171.9" y="437" width="1.7" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="1174.86" y="447.5" ></text>
</g>
<g >
<title>__sys_recvfrom (44,381,223 samples, 0.35%)</title><rect x="954.2" y="357" width="4.2" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text x="957.22" y="367.5" ></text>
</g>
<g >
<title>vfs_read (74,888,722 samples, 0.60%)</title><rect x="10.0" y="453" width="7.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="13.00" y="463.5" ></text>
</g>
<g >
<title>obj_cgroup_uncharge (1,696,849 samples, 0.01%)</title><rect x="956.4" y="213" width="0.2" height="15.0" fill="rgb(205,0,0)" rx="2" ry="2" />
<text x="959.41" y="223.5" ></text>
</g>
<g >
<title>irq_chip_ack_parent (88,098,556 samples, 0.70%)</title><rect x="998.3" y="373" width="8.3" height="15.0" fill="rgb(208,15,3)" rx="2" ry="2" />
<text x="1001.33" y="383.5" ></text>
</g>
<g >
<title>enqueue_task (10,317,345 samples, 0.08%)</title><rect x="1169.6" y="421" width="1.0" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="1172.63" y="431.5" ></text>
</g>
<g >
<title>tick_sched_handle (2,114,728 samples, 0.02%)</title><rect x="963.5" y="213" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="966.46" y="223.5" ></text>
</g>
<g >
<title>arch_call_rest_init (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="485" width="6.8" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="1186.23" y="495.5" ></text>
</g>
<g >
<title>scsi_alloc_sgtables (17,372,145 samples, 0.14%)</title><rect x="357.7" y="69" width="1.7" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text x="360.73" y="79.5" ></text>
</g>
<g >
<title>find_vma (83,576,078 samples, 0.67%)</title><rect x="66.0" y="389" width="7.9" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="68.97" y="399.5" ></text>
</g>
<g >
<title>[unknown] (74,888,722 samples, 0.60%)</title><rect x="10.0" y="549" width="7.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="13.00" y="559.5" ></text>
</g>
<g >
<title>scsi_finish_command (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="325" width="7.1" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text x="1009.64" y="335.5" ></text>
</g>
<g >
<title>pick_next_task_fair (1,524,838 samples, 0.01%)</title><rect x="961.5" y="261" width="0.2" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="964.53" y="271.5" ></text>
</g>
<g >
<title>run_timer_softirq (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="373" width="8.5" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="1023.28" y="383.5" ></text>
</g>
<g >
<title>heapgettup_pagemode (16,171,385 samples, 0.13%)</title><rect x="203.1" y="533" width="1.6" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="206.13" y="543.5" ></text>
</g>
<g >
<title>page_mapping (17,784,321 samples, 0.14%)</title><rect x="348.3" y="277" width="1.7" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="351.32" y="287.5" ></text>
</g>
<g >
<title>heapgetpage (15,660,448 samples, 0.13%)</title><rect x="871.0" y="549" width="1.5" height="15.0" fill="rgb(254,225,53)" rx="2" ry="2" />
<text x="874.02" y="559.5" ></text>
</g>
<g >
<title>__set_cpus_allowed_ptr_locked (21,540,849 samples, 0.17%)</title><rect x="63.2" y="357" width="2.1" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text x="66.25" y="367.5" ></text>
</g>
<g >
<title>xas_load (13,936,765 samples, 0.11%)</title><rect x="369.4" y="277" width="1.3" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="372.37" y="287.5" ></text>
</g>
<g >
<title>enqueue_task_fair (6,006,388 samples, 0.05%)</title><rect x="1169.8" y="405" width="0.6" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text x="1172.84" y="415.5" ></text>
</g>
<g >
<title>common_interrupt (306,544,023 samples, 2.45%)</title><rect x="984.9" y="421" width="28.9" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="987.88" y="431.5" >co..</text>
</g>
<g >
<title>postgres (8,862,821,933 samples, 70.85%)</title><rect x="104.7" y="565" width="836.0" height="15.0" fill="rgb(233,131,31)" rx="2" ry="2" />
<text x="107.66" y="575.5" >postgres</text>
</g>
<g >
<title>need_update (2,323,233 samples, 0.02%)</title><rect x="1168.6" y="421" width="0.2" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text x="1171.61" y="431.5" ></text>
</g>
<g >
<title>__blk_mq_alloc_request (75,817,615 samples, 0.61%)</title><rect x="22.1" y="277" width="7.1" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="25.07" y="287.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (74,953,870 samples, 0.60%)</title><rect x="90.9" y="437" width="7.1" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="93.92" y="447.5" ></text>
</g>
<g >
<title>simple_copy_to_iter (10,858,396 samples, 0.09%)</title><rect x="957.2" y="245" width="1.0" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text x="960.19" y="255.5" ></text>
</g>
<g >
<title>submit_bio_noacct (75,817,615 samples, 0.61%)</title><rect x="22.1" y="325" width="7.1" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="25.07" y="335.5" ></text>
</g>
<g >
<title>unix_poll (1,836,780 samples, 0.01%)</title><rect x="962.8" y="325" width="0.2" height="15.0" fill="rgb(244,179,43)" rx="2" ry="2" />
<text x="965.84" y="335.5" ></text>
</g>
<g >
<title>blk_update_request (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="277" width="7.1" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1009.64" y="287.5" ></text>
</g>
<g >
<title>dequeue_task (5,781,440 samples, 0.05%)</title><rect x="961.0" y="277" width="0.5" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="963.99" y="287.5" ></text>
</g>
<g >
<title>account_system_time (2,114,728 samples, 0.02%)</title><rect x="963.5" y="165" width="0.2" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="966.46" y="175.5" ></text>
</g>
<g >
<title>__x64_sys_sendto (20,967,575 samples, 0.17%)</title><rect x="387.8" y="501" width="2.0" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="390.78" y="511.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (14,327,766 samples, 0.11%)</title><rect x="860.9" y="517" width="1.4" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="863.94" y="527.5" ></text>
</g>
<g >
<title>blk_mq_flush_plug_list (68,734,431 samples, 0.55%)</title><rect x="352.9" y="293" width="6.5" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="355.89" y="303.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (75,914,127 samples, 0.61%)</title><rect x="1006.6" y="389" width="7.2" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1009.64" y="399.5" ></text>
</g>
<g >
<title>__wake_up_common (20,967,575 samples, 0.17%)</title><rect x="387.8" y="389" width="2.0" height="15.0" fill="rgb(248,197,47)" rx="2" ry="2" />
<text x="390.78" y="399.5" ></text>
</g>
<g >
<title>blk_mq_get_tag (75,817,615 samples, 0.61%)</title><rect x="22.1" y="261" width="7.1" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="25.07" y="271.5" ></text>
</g>
<g >
<title>xas_load (37,950,264 samples, 0.30%)</title><rect x="370.7" y="325" width="3.6" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="373.68" y="335.5" ></text>
</g>
<g >
<title>mutex_unlock (2,319,653 samples, 0.02%)</title><rect x="52.4" y="373" width="0.2" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="55.41" y="383.5" ></text>
</g>
<g >
<title>mwait_idle_with_hints.constprop.0 (1,319,773,215 samples, 10.55%)</title><rect x="1028.8" y="437" width="124.5" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="1031.77" y="447.5" >mwait_idle_with..</text>
</g>
<g >
<title>sched_setaffinity@@GLIBC_2.3.4 (40,490,313 samples, 0.32%)</title><rect x="61.7" y="453" width="3.8" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text x="64.72" y="463.5" ></text>
</g>
<g >
<title>swapper (2,291,322,007 samples, 18.32%)</title><rect x="973.9" y="565" width="216.1" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text x="976.87" y="575.5" >swapper</text>
</g>
<g >
<title>perf_adjust_freq_unthr_context (2,572,932 samples, 0.02%)</title><rect x="1016.3" y="277" width="0.2" height="15.0" fill="rgb(236,146,35)" rx="2" ry="2" />
<text x="1019.28" y="287.5" ></text>
</g>
<g >
<title>kthread (75,817,615 samples, 0.61%)</title><rect x="22.1" y="533" width="7.1" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="25.07" y="543.5" ></text>
</g>
<g >
<title>sched_ttwu_pending (12,352,841 samples, 0.10%)</title><rect x="1169.6" y="453" width="1.2" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="1172.63" y="463.5" ></text>
</g>
<g >
<title>ret_from_fork (50,929,006 samples, 0.41%)</title><rect x="17.3" y="549" width="4.8" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="20.27" y="559.5" ></text>
</g>
<g >
<title>update_blocked_averages (15,968,797 samples, 0.13%)</title><rect x="932.8" y="453" width="1.5" height="15.0" fill="rgb(240,163,38)" rx="2" ry="2" />
<text x="935.79" y="463.5" ></text>
</g>
<g >
<title>tts_buffer_heap_getsomeattrs (52,241,241 samples, 0.42%)</title><rect x="935.7" y="549" width="5.0" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text x="938.74" y="559.5" ></text>
</g>
<g >
<title>cmd_record (180,905,682 samples, 1.45%)</title><rect x="73.9" y="549" width="17.0" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="76.86" y="559.5" ></text>
</g>
<g >
<title>_raw_spin_lock (1,778,398 samples, 0.01%)</title><rect x="1171.3" y="453" width="0.1" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text x="1174.28" y="463.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (180,905,682 samples, 1.45%)</title><rect x="73.9" y="517" width="17.0" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="76.86" y="527.5" ></text>
</g>
<g >
<title>psi_task_switch (15,695,981 samples, 0.13%)</title><rect x="265.2" y="309" width="1.5" height="15.0" fill="rgb(230,118,28)" rx="2" ry="2" />
<text x="268.18" y="319.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (17,472,117 samples, 0.14%)</title><rect x="951.3" y="405" width="1.7" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="954.33" y="415.5" ></text>
</g>
<g >
<title>enqueue_hrtimer (73,555,326 samples, 0.59%)</title><rect x="1176.1" y="421" width="6.9" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="1179.05" y="431.5" ></text>
</g>
<g >
<title>tick_nohz_idle_stop_tick (19,363,491 samples, 0.15%)</title><rect x="1167.6" y="469" width="1.8" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text x="1170.59" y="479.5" ></text>
</g>
<g >
<title>skb_free_head (13,737,732 samples, 0.11%)</title><rect x="955.3" y="261" width="1.3" height="15.0" fill="rgb(221,77,18)" rx="2" ry="2" />
<text x="958.28" y="271.5" ></text>
</g>
<g >
<title>__kmalloc_node (1,146,988 samples, 0.01%)</title><rect x="63.1" y="357" width="0.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="66.14" y="367.5" ></text>
</g>
<g >
<title>irq_exit_rcu (126,866,520 samples, 1.01%)</title><rect x="1016.8" y="405" width="12.0" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1019.80" y="415.5" ></text>
</g>
<g >
<title>__mod_lruvec_state (2,128,259 samples, 0.02%)</title><rect x="952.6" y="229" width="0.2" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="955.61" y="239.5" ></text>
</g>
<g >
<title>irq_exit_rcu (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="341" width="6.8" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1186.23" y="351.5" ></text>
</g>
<g >
<title>handle_mm_fault (88,590,590 samples, 0.71%)</title><rect x="964.9" y="485" width="8.4" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="967.94" y="495.5" ></text>
</g>
<g >
<title>scsi_io_completion (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="309" width="7.1" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text x="1009.64" y="319.5" ></text>
</g>
<g >
<title>tick_nohz_idle_exit (78,402,317 samples, 0.63%)</title><rect x="1175.6" y="485" width="7.4" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="1178.59" y="495.5" ></text>
</g>
<g >
<title>do_open (80,096,276 samples, 0.64%)</title><rect x="44.5" y="309" width="7.6" height="15.0" fill="rgb(233,129,31)" rx="2" ry="2" />
<text x="47.50" y="319.5" ></text>
</g>
<g >
<title>submit_bio (75,817,615 samples, 0.61%)</title><rect x="22.1" y="341" width="7.1" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="25.07" y="351.5" ></text>
</g>
<g >
<title>tick_irq_enter (2,848,025 samples, 0.02%)</title><rect x="1016.5" y="389" width="0.3" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="1019.54" y="399.5" ></text>
</g>
<g >
<title>acpi_ns_search_one_scope (2,175,907 samples, 0.02%)</title><rect x="17.1" y="293" width="0.2" height="15.0" fill="rgb(212,32,7)" rx="2" ry="2" />
<text x="20.06" y="303.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (90,071,905 samples, 0.72%)</title><rect x="1020.3" y="309" width="8.5" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="1023.28" y="319.5" ></text>
</g>
<g >
<title>filemap_read (1,608,208,968 samples, 12.86%)</title><rect x="227.3" y="421" width="151.7" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="230.33" y="431.5" >filemap_read</text>
</g>
<g >
<title>__pollwait (2,257,736 samples, 0.02%)</title><rect x="962.6" y="325" width="0.2" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text x="965.63" y="335.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (14,327,766 samples, 0.11%)</title><rect x="860.9" y="533" width="1.4" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="863.94" y="543.5" ></text>
</g>
<g >
<title>sock_def_readable (20,967,575 samples, 0.17%)</title><rect x="387.8" y="437" width="2.0" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text x="390.78" y="447.5" ></text>
</g>
<g >
<title>do_syscall_64 (80,096,276 samples, 0.64%)</title><rect x="44.5" y="437" width="7.6" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="47.50" y="447.5" ></text>
</g>
<g >
<title>__mem_cgroup_charge (4,504,858 samples, 0.04%)</title><rect x="971.9" y="421" width="0.5" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="974.93" y="431.5" ></text>
</g>
<g >
<title>kworker/u290:3- (149,182,557 samples, 1.19%)</title><rect x="29.2" y="565" width="14.1" height="15.0" fill="rgb(214,43,10)" rx="2" ry="2" />
<text x="32.22" y="575.5" ></text>
</g>
<g >
<title>pq_endmessage_reuse (15,624,561 samples, 0.12%)</title><rect x="213.5" y="517" width="1.4" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="216.45" y="527.5" ></text>
</g>
<g >
<title>x86_pmu_enable (11,591,191 samples, 0.09%)</title><rect x="59.2" y="229" width="1.1" height="15.0" fill="rgb(244,179,43)" rx="2" ry="2" />
<text x="62.21" y="239.5" ></text>
</g>
<g >
<title>acpi_ns_lookup (2,175,907 samples, 0.02%)</title><rect x="17.1" y="341" width="0.2" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="20.06" y="351.5" ></text>
</g>
<g >
<title>exc_page_fault (83,576,078 samples, 0.67%)</title><rect x="66.0" y="421" width="7.9" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="68.97" y="431.5" ></text>
</g>
<g >
<title>__wait_for_common (9,672,160 samples, 0.08%)</title><rect x="64.4" y="309" width="0.9" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="67.37" y="319.5" ></text>
</g>
<g >
<title>__sched_setaffinity (21,540,849 samples, 0.17%)</title><rect x="63.2" y="373" width="2.1" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="66.25" y="383.5" ></text>
</g>
<g >
<title>unix_stream_sendmsg (20,967,575 samples, 0.17%)</title><rect x="387.8" y="453" width="2.0" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="390.78" y="463.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (74,888,722 samples, 0.60%)</title><rect x="10.0" y="517" width="7.1" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="13.00" y="527.5" ></text>
</g>
<g >
<title>lru_cache_add_inactive_or_unevictable (2,128,259 samples, 0.02%)</title><rect x="952.6" y="293" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="955.61" y="303.5" ></text>
</g>
<g >
<title>blk_mq_submit_bio (75,817,615 samples, 0.61%)</title><rect x="22.1" y="293" width="7.1" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text x="25.07" y="303.5" ></text>
</g>
<g >
<title>raw_spin_rq_lock_nested (2,262,347 samples, 0.02%)</title><rect x="1169.4" y="453" width="0.2" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="1172.42" y="463.5" ></text>
</g>
<g >
<title>__hrtimer_start_range_ns (73,555,326 samples, 0.59%)</title><rect x="1176.1" y="437" width="6.9" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text x="1179.05" y="447.5" ></text>
</g>
<g >
<title>update_process_times (4,782,572 samples, 0.04%)</title><rect x="1016.1" y="325" width="0.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="1019.08" y="335.5" ></text>
</g>
<g >
<title>set_next_entity (2,293,567 samples, 0.02%)</title><rect x="1173.6" y="437" width="0.2" height="15.0" fill="rgb(232,125,29)" rx="2" ry="2" />
<text x="1176.58" y="447.5" ></text>
</g>
<g >
<title>tick_nohz_restart (78,402,317 samples, 0.63%)</title><rect x="1175.6" y="469" width="7.4" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text x="1178.59" y="479.5" ></text>
</g>
<g >
<title>ReadBuffer_common (16,587,188 samples, 0.13%)</title><rect x="122.0" y="549" width="1.6" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text x="125.03" y="559.5" ></text>
</g>
<g >
<title>page_cache_async_ra (1,140,737,731 samples, 9.12%)</title><rect x="266.7" y="389" width="107.6" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="269.66" y="399.5" >page_cache_as..</text>
</g>
<g >
<title>__x64_sys_execve (80,096,276 samples, 0.64%)</title><rect x="44.5" y="421" width="7.6" height="15.0" fill="rgb(247,193,46)" rx="2" ry="2" />
<text x="47.50" y="431.5" ></text>
</g>
<g >
<title>do_brk_flags (2,761,236 samples, 0.02%)</title><rect x="940.7" y="453" width="0.2" height="15.0" fill="rgb(220,73,17)" rx="2" ry="2" />
<text x="943.67" y="463.5" ></text>
</g>
<g >
<title>exc_page_fault (90,619,443 samples, 0.72%)</title><rect x="964.9" y="517" width="8.6" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="967.94" y="527.5" ></text>
</g>
<g >
<title>pollwake (20,967,575 samples, 0.17%)</title><rect x="387.8" y="373" width="2.0" height="15.0" fill="rgb(238,154,37)" rx="2" ry="2" />
<text x="390.78" y="383.5" ></text>
</g>
<g >
<title>pick_next_task (20,526,204 samples, 0.16%)</title><rect x="1171.9" y="453" width="1.9" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text x="1174.86" y="463.5" ></text>
</g>
<g >
<title>sd_setup_read_write_cmnd (17,372,145 samples, 0.14%)</title><rect x="357.7" y="85" width="1.7" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text x="360.73" y="95.5" ></text>
</g>
<g >
<title>update_sd_lb_stats.constprop.0 (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="293" width="3.5" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="1019.80" y="303.5" ></text>
</g>
<g >
<title>kmem_cache_free (4,714,391 samples, 0.04%)</title><rect x="954.4" y="261" width="0.4" height="15.0" fill="rgb(254,225,53)" rx="2" ry="2" />
<text x="957.38" y="271.5" ></text>
</g>
<g >
<title>psi_group_change (2,100,650 samples, 0.02%)</title><rect x="1170.4" y="389" width="0.2" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="1173.41" y="399.5" ></text>
</g>
<g >
<title>worker_thread (75,817,615 samples, 0.61%)</title><rect x="22.1" y="517" width="7.1" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="25.07" y="527.5" ></text>
</g>
<g >
<title>update_load_avg (4,288,007 samples, 0.03%)</title><rect x="1170.0" y="373" width="0.4" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="1173.00" y="383.5" ></text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (14,792,943 samples, 0.12%)</title><rect x="158.1" y="469" width="1.4" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="161.12" y="479.5" ></text>
</g>
<g >
<title>new_sync_read (74,888,722 samples, 0.60%)</title><rect x="10.0" y="437" width="7.1" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="13.00" y="447.5" ></text>
</g>
<g >
<title>native_write_msr (6,691,320 samples, 0.05%)</title><rect x="59.7" y="197" width="0.6" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="62.67" y="207.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (55,849,916 samples, 0.45%)</title><rect x="947.7" y="421" width="5.3" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="950.71" y="431.5" ></text>
</g>
<g >
<title>bio_endio (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="261" width="7.1" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text x="1009.64" y="271.5" ></text>
</g>
<g >
<title>read_pages (188,650,756 samples, 1.51%)</title><rect x="352.9" y="341" width="17.8" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="355.89" y="351.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (83,576,078 samples, 0.67%)</title><rect x="66.0" y="437" width="7.9" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="68.97" y="447.5" ></text>
</g>
<g >
<title>bio_alloc_bioset (16,299,198 samples, 0.13%)</title><rect x="364.9" y="293" width="1.5" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="367.89" y="303.5" ></text>
</g>
<g >
<title>set_next_task_idle (2,662,771 samples, 0.02%)</title><rect x="961.7" y="245" width="0.2" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text x="964.68" y="255.5" ></text>
</g>
<g >
<title>ReleaseBuffer (21,771,294 samples, 0.17%)</title><rect x="147.4" y="533" width="2.1" height="15.0" fill="rgb(220,71,17)" rx="2" ry="2" />
<text x="150.40" y="543.5" ></text>
</g>
<g >
<title>schedule (16,366,903 samples, 0.13%)</title><rect x="960.8" y="309" width="1.6" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="963.81" y="319.5" ></text>
</g>
<g >
<title>rcu_idle_enter (1,634,795 samples, 0.01%)</title><rect x="1153.7" y="453" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1156.66" y="463.5" ></text>
</g>
<g >
<title>_nohz_idle_balance.constprop.0.isra.0 (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="293" width="6.8" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1186.23" y="303.5" ></text>
</g>
<g >
<title>ReadBufferExtended (14,815,640 samples, 0.12%)</title><rect x="120.6" y="549" width="1.4" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text x="123.63" y="559.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (145,676,058 samples, 1.16%)</title><rect x="90.9" y="501" width="13.8" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="93.92" y="511.5" ></text>
</g>
<g >
<title>do_idle (2,215,295,891 samples, 17.71%)</title><rect x="974.0" y="501" width="209.0" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="977.03" y="511.5" >do_idle</text>
</g>
<g >
<title>sock_recvmsg (42,409,680 samples, 0.34%)</title><rect x="954.2" y="341" width="4.0" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="957.22" y="351.5" ></text>
</g>
<g >
<title>account_process_tick (2,114,728 samples, 0.02%)</title><rect x="963.5" y="181" width="0.2" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text x="966.46" y="191.5" ></text>
</g>
<g >
<title>process_one_work (149,182,557 samples, 1.19%)</title><rect x="29.2" y="501" width="14.1" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="32.22" y="511.5" ></text>
</g>
<g >
<title>rmqueue (48,065,899 samples, 0.38%)</title><rect x="281.9" y="277" width="4.5" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text x="284.87" y="287.5" ></text>
</g>
<g >
<title>do_anonymous_page (88,590,590 samples, 0.71%)</title><rect x="964.9" y="437" width="8.4" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="967.94" y="447.5" ></text>
</g>
<g >
<title>__blk_mq_run_hw_queue (68,734,431 samples, 0.55%)</title><rect x="352.9" y="229" width="6.5" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="355.89" y="239.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (14,792,943 samples, 0.12%)</title><rect x="158.1" y="501" width="1.4" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="161.12" y="511.5" ></text>
</g>
<g >
<title>do_user_addr_fault (88,590,590 samples, 0.71%)</title><rect x="964.9" y="501" width="8.4" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="967.94" y="511.5" ></text>
</g>
<g >
<title>find_busiest_group (36,794,615 samples, 0.29%)</title><rect x="1016.8" y="309" width="3.5" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text x="1019.80" y="319.5" ></text>
</g>
<g >
<title>__mod_memcg_lruvec_state (2,128,259 samples, 0.02%)</title><rect x="952.6" y="213" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="955.61" y="223.5" ></text>
</g>
<g >
<title>__writeback_inodes_wb (149,182,557 samples, 1.19%)</title><rect x="29.2" y="437" width="14.1" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text x="32.22" y="447.5" ></text>
</g>
<g >
<title>enqueue_task (11,868,689 samples, 0.09%)</title><rect x="63.2" y="245" width="1.2" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="66.25" y="255.5" ></text>
</g>
<g >
<title>process_one_work (75,817,615 samples, 0.61%)</title><rect x="22.1" y="501" width="7.1" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="25.07" y="511.5" ></text>
</g>
<g >
<title>internal_putbytes (34,871,901 samples, 0.28%)</title><rect x="872.5" y="549" width="3.3" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="875.50" y="559.5" ></text>
</g>
<g >
<title>__update_load_avg_cfs_rq (1,871,840 samples, 0.01%)</title><rect x="1170.2" y="357" width="0.2" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text x="1173.23" y="367.5" ></text>
</g>
<g >
<title>igb_watchdog_task (50,929,006 samples, 0.41%)</title><rect x="17.3" y="485" width="4.8" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="20.27" y="495.5" ></text>
</g>
<g >
<title>AllocSetAlloc (46,057,996 samples, 0.37%)</title><rect x="128.8" y="533" width="4.4" height="15.0" fill="rgb(231,124,29)" rx="2" ry="2" />
<text x="131.81" y="543.5" ></text>
</g>
<g >
<title>int8out (21,524,489 samples, 0.17%)</title><rect x="204.7" y="533" width="2.0" height="15.0" fill="rgb(241,167,40)" rx="2" ry="2" />
<text x="207.65" y="543.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (15,968,797 samples, 0.13%)</title><rect x="932.8" y="517" width="1.5" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="935.79" y="527.5" ></text>
</g>
<g >
<title>PQgetResult (247,517,398 samples, 1.98%)</title><rect x="940.9" y="453" width="23.4" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text x="943.93" y="463.5" >P..</text>
</g>
<g >
<title>rb_erase (1,946,455 samples, 0.02%)</title><rect x="1175.9" y="405" width="0.2" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="1178.87" y="415.5" ></text>
</g>
<g >
<title>generic_file_read_iter (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="437" width="153.3" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text x="228.75" y="447.5" >generic_file_read_i..</text>
</g>
<g >
<title>__hrtimer_run_queues (29,021,961 samples, 0.23%)</title><rect x="1013.8" y="373" width="2.7" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="1016.80" y="383.5" ></text>
</g>
<g >
<title>worker_thread (149,182,557 samples, 1.19%)</title><rect x="29.2" y="517" width="14.1" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="32.22" y="527.5" ></text>
</g>
<g >
<title>wb_do_writeback (75,817,615 samples, 0.61%)</title><rect x="22.1" y="469" width="7.1" height="15.0" fill="rgb(245,187,44)" rx="2" ry="2" />
<text x="25.07" y="479.5" ></text>
</g>
<g >
<title>put_pid (4,399,328 samples, 0.04%)</title><rect x="956.6" y="261" width="0.4" height="15.0" fill="rgb(223,84,20)" rx="2" ry="2" />
<text x="959.57" y="271.5" ></text>
</g>
<g >
<title>__smp_call_single_queue (20,967,575 samples, 0.17%)</title><rect x="387.8" y="309" width="2.0" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="390.78" y="319.5" ></text>
</g>
<g >
<title>__update_idle_core (2,662,771 samples, 0.02%)</title><rect x="961.7" y="229" width="0.2" height="15.0" fill="rgb(235,139,33)" rx="2" ry="2" />
<text x="964.68" y="239.5" ></text>
</g>
<g >
<title>__update_load_avg_se (1,718,381 samples, 0.01%)</title><rect x="1169.8" y="373" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1172.84" y="383.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (14,792,943 samples, 0.12%)</title><rect x="158.1" y="437" width="1.4" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="161.12" y="447.5" ></text>
</g>
<g >
<title>ResourceOwnerEnlarge (15,653,995 samples, 0.13%)</title><rect x="123.6" y="549" width="1.5" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="126.59" y="559.5" ></text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (2,114,728 samples, 0.02%)</title><rect x="963.5" y="277" width="0.2" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="966.46" y="287.5" ></text>
</g>
<g >
<title>__update_blocked_fair (15,968,797 samples, 0.13%)</title><rect x="932.8" y="437" width="1.5" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="935.79" y="447.5" ></text>
</g>
<g >
<title>kmem_cache_free (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="165" width="7.1" height="15.0" fill="rgb(254,225,53)" rx="2" ry="2" />
<text x="1009.64" y="175.5" ></text>
</g>
<g >
<title>memcg_slab_free_hook (2,352,817 samples, 0.02%)</title><rect x="954.4" y="245" width="0.2" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="957.38" y="255.5" ></text>
</g>
<g >
<title>exc_page_fault (15,385,851 samples, 0.12%)</title><rect x="951.5" y="389" width="1.5" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="954.53" y="399.5" ></text>
</g>
<g >
<title>[psql] (247,517,398 samples, 1.98%)</title><rect x="940.9" y="517" width="23.4" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="943.93" y="527.5" >[..</text>
</g>
<g >
<title>__pagevec_lru_add (1,941,267 samples, 0.02%)</title><rect x="973.1" y="389" width="0.2" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="976.12" y="399.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (18,391,962 samples, 0.15%)</title><rect x="1159.2" y="373" width="1.7" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="1162.20" y="383.5" ></text>
</g>
<g >
<title>run_rebalance_domains (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="309" width="6.8" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="1186.23" y="319.5" ></text>
</g>
<g >
<title>run_builtin (80,096,276 samples, 0.64%)</title><rect x="44.5" y="517" width="7.6" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="47.50" y="527.5" ></text>
</g>
<g >
<title>rebalance_domains (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="277" width="6.8" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text x="1186.23" y="287.5" ></text>
</g>
<g >
<title>update_blocked_averages (1,740,097 samples, 0.01%)</title><rect x="952.8" y="245" width="0.2" height="15.0" fill="rgb(240,163,38)" rx="2" ry="2" />
<text x="955.81" y="255.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (95,283,100 samples, 0.76%)</title><rect x="192.6" y="533" width="9.0" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="195.58" y="543.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (158,736,506 samples, 1.27%)</title><rect x="1013.8" y="421" width="15.0" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1016.80" y="431.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (15,968,797 samples, 0.13%)</title><rect x="932.8" y="485" width="1.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="935.79" y="495.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (190,202,785 samples, 1.52%)</title><rect x="940.9" y="437" width="18.0" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="943.93" y="447.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (73,555,326 samples, 0.59%)</title><rect x="1176.1" y="453" width="6.9" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1179.05" y="463.5" ></text>
</g>
<g >
<title>__read_end_io (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="229" width="7.1" height="15.0" fill="rgb(242,174,41)" rx="2" ry="2" />
<text x="1009.64" y="239.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="373" width="6.8" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="1186.23" y="383.5" ></text>
</g>
<g >
<title>rmqueue (4,215,817 samples, 0.03%)</title><rect x="972.5" y="373" width="0.4" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text x="975.53" y="383.5" ></text>
</g>
<g >
<title>__put_user_nocheck_8 (4,396,828 samples, 0.04%)</title><rect x="963.2" y="325" width="0.5" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="966.24" y="335.5" ></text>
</g>
<g >
<title>__pollwait (180,905,682 samples, 1.45%)</title><rect x="73.9" y="421" width="17.0" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text x="76.86" y="431.5" ></text>
</g>
<g >
<title>__switch_to (1,710,689 samples, 0.01%)</title><rect x="973.9" y="549" width="0.1" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text x="976.87" y="559.5" ></text>
</g>
<g >
<title>rcu_read_unlock_strict (15,071,823 samples, 0.12%)</title><rect x="351.5" y="341" width="1.4" height="15.0" fill="rgb(221,76,18)" rx="2" ry="2" />
<text x="354.46" y="351.5" ></text>
</g>
<g >
<title>xa_load (37,950,264 samples, 0.30%)</title><rect x="370.7" y="341" width="3.6" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="373.68" y="351.5" ></text>
</g>
<g >
<title>skb_release_data (18,490,190 samples, 0.15%)</title><rect x="954.8" y="277" width="1.8" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="957.83" y="287.5" ></text>
</g>
<g >
<title>try_to_wake_up (11,868,689 samples, 0.09%)</title><rect x="63.2" y="277" width="1.2" height="15.0" fill="rgb(220,70,16)" rx="2" ry="2" />
<text x="66.25" y="287.5" ></text>
</g>
<g >
<title>wb_do_writeback (149,182,557 samples, 1.19%)</title><rect x="29.2" y="469" width="14.1" height="15.0" fill="rgb(245,187,44)" rx="2" ry="2" />
<text x="32.22" y="479.5" ></text>
</g>
<g >
<title>acpi_ps_get_next_namepath (2,175,907 samples, 0.02%)</title><rect x="17.1" y="357" width="0.2" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="20.06" y="367.5" ></text>
</g>
<g >
<title>do_user_addr_fault (83,576,078 samples, 0.67%)</title><rect x="66.0" y="405" width="7.9" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="68.97" y="415.5" ></text>
</g>
<g >
<title>x86_64_start_kernel (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="533" width="6.8" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text x="1186.23" y="543.5" ></text>
</g>
<g >
<title>__get_user_8 (2,333,335 samples, 0.02%)</title><rect x="963.9" y="309" width="0.2" height="15.0" fill="rgb(242,171,41)" rx="2" ry="2" />
<text x="966.88" y="319.5" ></text>
</g>
<g >
<title>flush_smp_call_function_from_idle (14,615,188 samples, 0.12%)</title><rect x="1169.4" y="485" width="1.4" height="15.0" fill="rgb(225,92,22)" rx="2" ry="2" />
<text x="1172.42" y="495.5" ></text>
</g>
<g >
<title>put_prev_task_fair (2,295,392 samples, 0.02%)</title><rect x="962.1" y="277" width="0.3" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text x="965.14" y="287.5" ></text>
</g>
<g >
<title>clear_page_dirty_for_io (73,212,495 samples, 0.59%)</title><rect x="36.4" y="309" width="6.9" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="39.39" y="319.5" ></text>
</g>
<g >
<title>__do_sys_brk (2,761,236 samples, 0.02%)</title><rect x="940.7" y="469" width="0.2" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text x="943.67" y="479.5" ></text>
</g>
<g >
<title>kthread (12,779,039 samples, 0.10%)</title><rect x="43.3" y="533" width="1.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="46.30" y="543.5" ></text>
</g>
<g >
<title>record__mmap_read_evlist.constprop.0 (83,576,078 samples, 0.67%)</title><rect x="66.0" y="485" width="7.9" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text x="68.97" y="495.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (2,128,259 samples, 0.02%)</title><rect x="952.6" y="245" width="0.2" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="955.61" y="255.5" ></text>
</g>
<g >
<title>__intel_pmu_enable_all.constprop.0 (1,401,441 samples, 0.01%)</title><rect x="59.5" y="197" width="0.2" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text x="62.54" y="207.5" ></text>
</g>
<g >
<title>scsi_prepare_cmd (17,372,145 samples, 0.14%)</title><rect x="357.7" y="117" width="1.7" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text x="360.73" y="127.5" ></text>
</g>
<g >
<title>perf_ioctl (90,209,356 samples, 0.72%)</title><rect x="52.4" y="405" width="8.5" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text x="55.41" y="415.5" ></text>
</g>
<g >
<title>native_apic_msr_eoi_write (88,098,556 samples, 0.70%)</title><rect x="998.3" y="357" width="8.3" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="1001.33" y="367.5" ></text>
</g>
<g >
<title>scsi_queue_rq (68,734,431 samples, 0.55%)</title><rect x="352.9" y="133" width="6.5" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text x="355.89" y="143.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (31,667,067 samples, 0.25%)</title><rect x="354.7" y="85" width="3.0" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="357.74" y="95.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (80,096,276 samples, 0.64%)</title><rect x="44.5" y="453" width="7.6" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="47.50" y="463.5" ></text>
</g>
<g >
<title>textout (16,816,998 samples, 0.13%)</title><rect x="191.0" y="517" width="1.6" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="193.99" y="527.5" ></text>
</g>
<g >
<title>page_counter_uncharge (1,477,863 samples, 0.01%)</title><rect x="954.8" y="197" width="0.2" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="957.83" y="207.5" ></text>
</g>
<g >
<title>__x64_sys_recvfrom (46,947,504 samples, 0.38%)</title><rect x="954.2" y="373" width="4.4" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="957.22" y="383.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (180,905,682 samples, 1.45%)</title><rect x="73.9" y="453" width="17.0" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="76.86" y="463.5" ></text>
</g>
<g >
<title>update_process_times (14,792,943 samples, 0.12%)</title><rect x="158.1" y="389" width="1.4" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="161.12" y="399.5" ></text>
</g>
<g >
<title>__strlen_evex (21,575,433 samples, 0.17%)</title><rect x="209.9" y="517" width="2.0" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="212.87" y="527.5" ></text>
</g>
<g >
<title>event_function (35,446,208 samples, 0.28%)</title><rect x="57.0" y="277" width="3.3" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text x="59.96" y="287.5" ></text>
</g>
<g >
<title>sock_poll (7,001,085 samples, 0.06%)</title><rect x="962.4" y="341" width="0.6" height="15.0" fill="rgb(223,84,20)" rx="2" ry="2" />
<text x="965.35" y="351.5" ></text>
</g>
<g >
<title>update_curr (3,636,461 samples, 0.03%)</title><rect x="961.2" y="245" width="0.3" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="964.19" y="255.5" ></text>
</g>
<g >
<title>blk_mq_get_tag (15,297,788 samples, 0.12%)</title><rect x="366.4" y="213" width="1.5" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="369.42" y="223.5" ></text>
</g>
<g >
<title>acpi_ps_execute_method (2,175,907 samples, 0.02%)</title><rect x="17.1" y="437" width="0.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="20.06" y="447.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (92,800,323 samples, 0.74%)</title><rect x="379.0" y="549" width="8.8" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="382.02" y="559.5" ></text>
</g>
<g >
<title>[unknown] (74,888,722 samples, 0.60%)</title><rect x="10.0" y="533" width="7.1" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text x="13.00" y="543.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (20,967,575 samples, 0.17%)</title><rect x="387.8" y="533" width="2.0" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="390.78" y="543.5" ></text>
</g>
<g >
<title>get_mem_cgroup_from_mm (2,368,450 samples, 0.02%)</title><rect x="952.0" y="277" width="0.3" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="955.03" y="287.5" ></text>
</g>
<g >
<title>ext4_writepages (73,212,495 samples, 0.59%)</title><rect x="36.4" y="373" width="6.9" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="39.39" y="383.5" ></text>
</g>
<g >
<title>__sbitmap_queue_get_shallow (75,817,615 samples, 0.61%)</title><rect x="22.1" y="229" width="7.1" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="25.07" y="239.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (2,424,689 samples, 0.02%)</title><rect x="958.6" y="421" width="0.3" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="961.64" y="431.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (2,114,728 samples, 0.02%)</title><rect x="963.5" y="245" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="966.46" y="255.5" ></text>
</g>
<g >
<title>do_syscall_64 (48,940,885 samples, 0.39%)</title><rect x="959.5" y="405" width="4.6" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="962.48" y="415.5" ></text>
</g>
<g >
<title>tick_nohz_next_event (18,391,962 samples, 0.15%)</title><rect x="1159.2" y="421" width="1.7" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text x="1162.20" y="431.5" ></text>
</g>
<g >
<title>menu_select (57,272,365 samples, 0.46%)</title><rect x="1155.5" y="453" width="5.4" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text x="1158.53" y="463.5" ></text>
</g>
<g >
<title>__rseq_handle_notify_resume (9,107,583 samples, 0.07%)</title><rect x="963.2" y="341" width="0.9" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="966.24" y="351.5" ></text>
</g>
<g >
<title>__pagevec_lru_add (2,128,259 samples, 0.02%)</title><rect x="952.6" y="261" width="0.2" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="955.61" y="271.5" ></text>
</g>
<g >
<title>mpage_end_io (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="245" width="7.1" height="15.0" fill="rgb(209,21,5)" rx="2" ry="2" />
<text x="1009.64" y="255.5" ></text>
</g>
<g >
<title>do_syscall_64 (48,815,791 samples, 0.39%)</title><rect x="954.0" y="389" width="4.6" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="957.04" y="399.5" ></text>
</g>
<g >
<title>new_sync_read (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="469" width="153.3" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="228.75" y="479.5" >new_sync_read</text>
</g>
<g >
<title>__blk_mq_sched_bio_merge (15,907,773 samples, 0.13%)</title><rect x="367.9" y="229" width="1.5" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="370.87" y="239.5" ></text>
</g>
<g >
<title>cpuidle_select (58,957,768 samples, 0.47%)</title><rect x="1155.4" y="469" width="5.5" height="15.0" fill="rgb(222,79,19)" rx="2" ry="2" />
<text x="1158.37" y="479.5" ></text>
</g>
<g >
<title>unix_stream_read_generic (40,619,376 samples, 0.32%)</title><rect x="954.4" y="309" width="3.8" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="957.38" y="319.5" ></text>
</g>
<g >
<title>kfree_skbmem (4,714,391 samples, 0.04%)</title><rect x="954.4" y="277" width="0.4" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="957.38" y="287.5" ></text>
</g>
<g >
<title>filemap_get_pages (1,204,152,098 samples, 9.63%)</title><rect x="260.7" y="405" width="113.6" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text x="263.68" y="415.5" >filemap_get_pa..</text>
</g>
<g >
<title>perf_poll (180,905,682 samples, 1.45%)</title><rect x="73.9" y="437" width="17.0" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="76.86" y="447.5" ></text>
</g>
<g >
<title>intel_pmu_enable_all (9,480,695 samples, 0.08%)</title><rect x="59.4" y="213" width="0.9" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text x="62.41" y="223.5" ></text>
</g>
<g >
<title>[unknown] (2,761,236 samples, 0.02%)</title><rect x="940.7" y="549" width="0.2" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="943.67" y="559.5" ></text>
</g>
<g >
<title>obj_cgroup_uncharge_pages (1,696,849 samples, 0.01%)</title><rect x="956.4" y="181" width="0.2" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="959.41" y="191.5" ></text>
</g>
<g >
<title>copy_page_to_iter (14,254,258 samples, 0.11%)</title><rect x="232.8" y="405" width="1.4" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="235.81" y="415.5" ></text>
</g>
<g >
<title>__page_cache_alloc (102,668,270 samples, 0.82%)</title><rect x="276.7" y="341" width="9.7" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="279.72" y="351.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (92,791,174 samples, 0.74%)</title><rect x="964.7" y="533" width="8.8" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="967.74" y="543.5" ></text>
</g>
<g >
<title>writeback_sb_inodes (75,817,615 samples, 0.61%)</title><rect x="22.1" y="421" width="7.1" height="15.0" fill="rgb(237,148,35)" rx="2" ry="2" />
<text x="25.07" y="431.5" ></text>
</g>
<g >
<title>pq_sendcountedtext (33,037,469 samples, 0.26%)</title><rect x="187.9" y="517" width="3.1" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text x="190.88" y="527.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath.part.0 (16,883,997 samples, 0.13%)</title><rect x="356.1" y="53" width="1.6" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="359.14" y="63.5" ></text>
</g>
<g >
<title>__hrtimer_start_range_ns (4,900,125 samples, 0.04%)</title><rect x="1167.6" y="421" width="0.5" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text x="1170.59" y="431.5" ></text>
</g>
<g >
<title>__switch_to_asm (12,699,742 samples, 0.10%)</title><rect x="61.8" y="437" width="1.2" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="64.83" y="447.5" ></text>
</g>
<g >
<title>fpregs_assert_state_consistent (2,028,853 samples, 0.02%)</title><rect x="973.3" y="453" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="976.30" y="463.5" ></text>
</g>
<g >
<title>__update_blocked_fair (1,740,097 samples, 0.01%)</title><rect x="952.8" y="229" width="0.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="955.81" y="239.5" ></text>
</g>
<g >
<title>__x64_sys_poll (180,905,682 samples, 1.45%)</title><rect x="73.9" y="485" width="17.0" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="76.86" y="495.5" ></text>
</g>
<g >
<title>start_kernel (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="501" width="6.8" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1186.23" y="511.5" ></text>
</g>
<g >
<title>worker_thread (2,175,907 samples, 0.02%)</title><rect x="17.1" y="517" width="0.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="20.06" y="527.5" ></text>
</g>
<g >
<title>filemap_update_page (15,695,981 samples, 0.13%)</title><rect x="265.2" y="389" width="1.5" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text x="268.18" y="399.5" ></text>
</g>
<g >
<title>clear_page_erms (16,789,335 samples, 0.13%)</title><rect x="280.3" y="277" width="1.6" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="283.29" y="287.5" ></text>
</g>
<g >
<title>__x64_sys_read (74,888,722 samples, 0.60%)</title><rect x="10.0" y="485" width="7.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="13.00" y="495.5" ></text>
</g>
<g >
<title>native_sched_clock (2,165,528 samples, 0.02%)</title><rect x="1153.5" y="405" width="0.2" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="1156.45" y="415.5" ></text>
</g>
<g >
<title>__add_to_page_cache_locked (87,838,819 samples, 0.70%)</title><rect x="268.4" y="341" width="8.3" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="271.43" y="351.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (126,866,520 samples, 1.01%)</title><rect x="1016.8" y="389" width="12.0" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1019.80" y="399.5" ></text>
</g>
<g >
<title>rb_next (18,232,637 samples, 0.15%)</title><rect x="1171.9" y="421" width="1.7" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1174.86" y="431.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64 (50,681,259 samples, 0.41%)</title><rect x="221.0" y="533" width="4.7" height="15.0" fill="rgb(239,156,37)" rx="2" ry="2" />
<text x="223.96" y="543.5" ></text>
</g>
<g >
<title>do_sys_poll (2,422,312 samples, 0.02%)</title><rect x="963.0" y="389" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="966.01" y="399.5" ></text>
</g>
<g >
<title>acpi_ns_search_and_enter (2,175,907 samples, 0.02%)</title><rect x="17.1" y="325" width="0.2" height="15.0" fill="rgb(233,129,31)" rx="2" ry="2" />
<text x="20.06" y="335.5" ></text>
</g>
<g >
<title>hook_file_open (80,096,276 samples, 0.64%)</title><rect x="44.5" y="261" width="7.6" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="47.50" y="271.5" ></text>
</g>
<g >
<title>wait_on_page_bit_common (15,695,981 samples, 0.13%)</title><rect x="265.2" y="373" width="1.5" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text x="268.18" y="383.5" ></text>
</g>
<g >
<title>do_execveat_common.isra.0 (80,096,276 samples, 0.64%)</title><rect x="44.5" y="405" width="7.6" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="47.50" y="415.5" ></text>
</g>
<g >
<title>newidle_balance (1,524,838 samples, 0.01%)</title><rect x="961.5" y="245" width="0.2" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text x="964.53" y="255.5" ></text>
</g>
<g >
<title>rcu_eqs_exit.constprop.0 (1,997,357 samples, 0.02%)</title><rect x="1153.3" y="421" width="0.2" height="15.0" fill="rgb(209,20,5)" rx="2" ry="2" />
<text x="1156.26" y="431.5" ></text>
</g>
<g >
<title>bprm_execve (80,096,276 samples, 0.64%)</title><rect x="44.5" y="389" width="7.6" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="47.50" y="399.5" ></text>
</g>
<g >
<title>do_syscall_64 (1,624,972,568 samples, 12.99%)</title><rect x="225.7" y="517" width="153.3" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="228.75" y="527.5" >do_syscall_64</text>
</g>
<g >
<title>pglz_decompress (15,972,732 samples, 0.13%)</title><rect x="186.4" y="517" width="1.5" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="189.37" y="527.5" ></text>
</g>
<g >
<title>smpboot_thread_fn (12,779,039 samples, 0.10%)</title><rect x="43.3" y="517" width="1.2" height="15.0" fill="rgb(246,193,46)" rx="2" ry="2" />
<text x="46.30" y="527.5" ></text>
</g>
<g >
<title>sched_clock_cpu (2,035,496 samples, 0.02%)</title><rect x="1170.6" y="421" width="0.2" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="1173.60" y="431.5" ></text>
</g>
<g >
<title>ttwu_do_activate (11,868,689 samples, 0.09%)</title><rect x="63.2" y="261" width="1.2" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text x="66.25" y="271.5" ></text>
</g>
<g >
<title>mempool_alloc_slab (16,299,198 samples, 0.13%)</title><rect x="364.9" y="261" width="1.5" height="15.0" fill="rgb(241,167,40)" rx="2" ry="2" />
<text x="367.89" y="271.5" ></text>
</g>
<g >
<title>schedule_hrtimeout_range_clock (18,179,140 samples, 0.15%)</title><rect x="960.6" y="325" width="1.8" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text x="963.64" y="335.5" ></text>
</g>
<g >
<title>lru_cache_add (47,974,063 samples, 0.38%)</title><rect x="346.9" y="325" width="4.6" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="349.94" y="335.5" ></text>
</g>
<g >
<title>rest_init (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="469" width="6.8" height="15.0" fill="rgb(252,217,51)" rx="2" ry="2" />
<text x="1186.23" y="479.5" ></text>
</g>
<g >
<title>add_to_page_cache_lru (689,744,154 samples, 5.51%)</title><rect x="286.4" y="341" width="65.1" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text x="289.40" y="351.5" >add_to_..</text>
</g>
<g >
<title>cpuidle_idle_call (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="421" width="6.8" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1186.23" y="431.5" ></text>
</g>
<g >
<title>palloc (22,268,516 samples, 0.18%)</title><rect x="875.8" y="549" width="2.1" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="878.79" y="559.5" ></text>
</g>
<g >
<title>slab_free_freelist_hook.constprop.0 (2,361,574 samples, 0.02%)</title><rect x="954.6" y="245" width="0.2" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text x="957.61" y="255.5" ></text>
</g>
<g >
<title>__mem_cgroup_charge (2,368,450 samples, 0.02%)</title><rect x="952.0" y="293" width="0.3" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="955.03" y="303.5" ></text>
</g>
<g >
<title>__poll (180,905,682 samples, 1.45%)</title><rect x="73.9" y="533" width="17.0" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="76.86" y="543.5" ></text>
</g>
<g >
<title>[psql] (247,517,398 samples, 1.98%)</title><rect x="940.9" y="501" width="23.4" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="943.93" y="511.5" >[..</text>
</g>
<g >
<title>tick_sched_timer (14,327,766 samples, 0.11%)</title><rect x="860.9" y="453" width="1.4" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="863.94" y="463.5" ></text>
</g>
<g >
<title>scsi_complete (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="341" width="7.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text x="1009.64" y="351.5" ></text>
</g>
<g >
<title>vma_interval_tree_iter_first (73,212,495 samples, 0.59%)</title><rect x="36.4" y="245" width="6.9" height="15.0" fill="rgb(229,110,26)" rx="2" ry="2" />
<text x="39.39" y="255.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (48,940,885 samples, 0.39%)</title><rect x="959.5" y="421" width="4.6" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="962.48" y="431.5" ></text>
</g>
<g >
<title>bvec_free (74,906,592 samples, 0.60%)</title><rect x="1006.6" y="181" width="7.1" height="15.0" fill="rgb(212,33,8)" rx="2" ry="2" />
<text x="1009.64" y="191.5" ></text>
</g>
<g >
<title>smp_call_function_single (69,621,029 samples, 0.56%)</title><rect x="53.8" y="325" width="6.6" height="15.0" fill="rgb(223,83,19)" rx="2" ry="2" />
<text x="56.81" y="335.5" ></text>
</g>
<g >
<title>page_counter_try_charge (16,552,524 samples, 0.13%)</title><rect x="342.3" y="261" width="1.5" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="345.29" y="271.5" ></text>
</g>
<g >
<title>filemap_get_read_batch (47,718,386 samples, 0.38%)</title><rect x="260.7" y="389" width="4.5" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="263.68" y="399.5" ></text>
</g>
<g >
<title>StrategyGetBuffer (15,081,073 samples, 0.12%)</title><rect x="149.5" y="533" width="1.4" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text x="152.45" y="543.5" ></text>
</g>
<g >
<title>flush_smp_call_function_queue (14,615,188 samples, 0.12%)</title><rect x="1169.4" y="469" width="1.4" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="1172.42" y="479.5" ></text>
</g>
<g >
<title>rseq_ip_fixup (4,710,755 samples, 0.04%)</title><rect x="963.7" y="325" width="0.4" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="966.66" y="335.5" ></text>
</g>
<g >
<title>flush_smp_call_function_from_idle (2,521,794 samples, 0.02%)</title><rect x="1183.0" y="501" width="0.2" height="15.0" fill="rgb(225,92,22)" rx="2" ry="2" />
<text x="1185.99" y="511.5" ></text>
</g>
<g >
<title>__blk_mq_sched_dispatch_requests (68,734,431 samples, 0.55%)</title><rect x="352.9" y="197" width="6.5" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="355.89" y="207.5" ></text>
</g>
<g >
<title>_perf_ioctl (86,760,879 samples, 0.69%)</title><rect x="52.4" y="389" width="8.2" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="55.41" y="399.5" ></text>
</g>
<g >
<title>blk_mq_do_dispatch_sched (68,734,431 samples, 0.55%)</title><rect x="352.9" y="181" width="6.5" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text x="355.89" y="191.5" ></text>
</g>
<g >
<title>tick_program_event (2,639,585 samples, 0.02%)</title><rect x="1167.8" y="405" width="0.3" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1170.80" y="415.5" ></text>
</g>
<g >
<title>find_busiest_group (71,793,633 samples, 0.57%)</title><rect x="1183.2" y="245" width="6.8" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text x="1186.23" y="255.5" ></text>
</g>
<g >
<title>rcu_idle_exit (1,997,357 samples, 0.02%)</title><rect x="1153.3" y="437" width="0.2" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="1156.26" y="447.5" ></text>
</g>
<g >
<title>__cond_resched (9,672,160 samples, 0.08%)</title><rect x="64.4" y="293" width="0.9" height="15.0" fill="rgb(217,58,14)" rx="2" ry="2" />
<text x="67.37" y="303.5" ></text>
</g>
<g >
<title>cfree@GLIBC_2.2.5 (16,530,979 samples, 0.13%)</title><rect x="201.6" y="533" width="1.5" height="15.0" fill="rgb(233,131,31)" rx="2" ry="2" />
<text x="204.57" y="543.5" ></text>
</g>
<g >
<title>process_one_work (50,929,006 samples, 0.41%)</title><rect x="17.3" y="501" width="4.8" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="20.27" y="511.5" ></text>
</g>
<g >
<title>blk_mq_submit_bio (31,205,561 samples, 0.25%)</title><rect x="366.4" y="245" width="3.0" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text x="369.42" y="255.5" ></text>
</g>
<g >
<title>__submit_bio (31,205,561 samples, 0.25%)</title><rect x="366.4" y="261" width="3.0" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="369.42" y="271.5" ></text>
</g>
<g >
<title>acpi_ns_search_parent_tree (2,175,907 samples, 0.02%)</title><rect x="17.1" y="309" width="0.2" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="20.06" y="319.5" ></text>
</g>
<g >
<title>schedule_idle (39,589,747 samples, 0.32%)</title><rect x="1171.9" y="485" width="3.7" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text x="1174.86" y="495.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (2,114,728 samples, 0.02%)</title><rect x="963.5" y="293" width="0.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="966.46" y="303.5" ></text>
</g>
<g >
<title>__irqentry_text_start (100,068,456 samples, 0.80%)</title><rect x="975.4" y="453" width="9.5" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text x="978.43" y="463.5" ></text>
</g>
<g >
<title>ExecInterpExpr (30,588,100 samples, 0.24%)</title><rect x="152.2" y="517" width="2.9" height="15.0" fill="rgb(225,96,22)" rx="2" ry="2" />
<text x="155.22" y="527.5" ></text>
</g>
<g >
<title>__writeback_inodes_wb (75,817,615 samples, 0.61%)</title><rect x="22.1" y="437" width="7.1" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text x="25.07" y="447.5" ></text>
</g>
<g >
<title>vfs_open (80,096,276 samples, 0.64%)</title><rect x="44.5" y="293" width="7.6" height="15.0" fill="rgb(232,125,29)" rx="2" ry="2" />
<text x="47.50" y="303.5" ></text>
</g>
<g >
<title>wb_writeback (75,817,615 samples, 0.61%)</title><rect x="22.1" y="453" width="7.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="25.07" y="463.5" ></text>
</g>
<g >
<title>page_cache_ra_unbounded (1,140,737,731 samples, 9.12%)</title><rect x="266.7" y="357" width="107.6" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="269.66" y="367.5" >page_cache_ra..</text>
</g>
<g >
<title>native_sched_clock (2,035,496 samples, 0.02%)</title><rect x="1170.6" y="389" width="0.2" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="1173.60" y="399.5" ></text>
</g>
<g >
<title>do_syscall_64 (145,676,058 samples, 1.16%)</title><rect x="90.9" y="485" width="13.8" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="93.92" y="495.5" ></text>
</g>
<g >
<title>cpuidle_idle_call (2,056,517,784 samples, 16.44%)</title><rect x="975.4" y="485" width="194.0" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="978.43" y="495.5" >cpuidle_idle_call</text>
</g>
<g >
<title>tick_nohz_stop_tick (19,363,491 samples, 0.15%)</title><rect x="1167.6" y="453" width="1.8" height="15.0" fill="rgb(235,139,33)" rx="2" ry="2" />
<text x="1170.59" y="463.5" ></text>
</g>
<g >
<title>do_syscall_64 (74,888,722 samples, 0.60%)</title><rect x="10.0" y="501" width="7.1" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="13.00" y="511.5" ></text>
</g>
<g >
<title>do_syscall_64 (20,967,575 samples, 0.17%)</title><rect x="387.8" y="517" width="2.0" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="390.78" y="527.5" ></text>
</g>
<g >
<title>alloc_pages_vma (8,052,033 samples, 0.06%)</title><rect x="972.4" y="421" width="0.7" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="975.36" y="431.5" ></text>
</g>
<g >
<title>scheduler_tick (4,782,572 samples, 0.04%)</title><rect x="1016.1" y="309" width="0.4" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text x="1019.08" y="319.5" ></text>
</g>
<g >
<title>__mod_memcg_lruvec_state (2,729,057 samples, 0.02%)</title><rect x="956.2" y="197" width="0.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="959.16" y="207.5" ></text>
</g>
<g >
<title>hrtimer_reprogram (1,110,320 samples, 0.01%)</title><rect x="1168.1" y="421" width="0.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="1171.05" y="431.5" ></text>
</g>
<g >
<title>__cgroup_throttle_swaprate (74,092,432 samples, 0.59%)</title><rect x="964.9" y="421" width="7.0" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="967.94" y="431.5" ></text>
</g>
</g>
</svg>
perf-with-64bit-1.8m.svgapplication/octet-stream; name=perf-with-64bit-1.8m.svgDownload
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" width="1200" height="630" onload="init(evt)" viewBox="0 0 1200 630" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<!-- Flame graph stack visualization. See https://github.com/brendangregg/FlameGraph for latest version, and http://www.brendangregg.com/flamegraphs.html for examples. -->
<!-- NOTES: -->
<defs>
<linearGradient id="background" y1="0" y2="1" x1="0" x2="0" >
<stop stop-color="#eeeeee" offset="5%" />
<stop stop-color="#eeeeb0" offset="95%" />
</linearGradient>
</defs>
<style type="text/css">
text { font-family:Verdana; font-size:12px; fill:rgb(0,0,0); }
#search, #ignorecase { opacity:0.1; cursor:pointer; }
#search:hover, #search.show, #ignorecase:hover, #ignorecase.show { opacity:1; }
#subtitle { text-anchor:middle; font-color:rgb(160,160,160); }
#title { text-anchor:middle; font-size:17px}
#unzoom { cursor:pointer; }
#frames > *:hover { stroke:black; stroke-width:0.5; cursor:pointer; }
.hide { display:none; }
.parent { opacity:0.5; }
</style>
<script type="text/ecmascript">
<![CDATA[
"use strict";
var details, searchbtn, unzoombtn, matchedtxt, svg, searching, currentSearchTerm, ignorecase, ignorecaseBtn;
function init(evt) {
details = document.getElementById("details").firstChild;
searchbtn = document.getElementById("search");
ignorecaseBtn = document.getElementById("ignorecase");
unzoombtn = document.getElementById("unzoom");
matchedtxt = document.getElementById("matched");
svg = document.getElementsByTagName("svg")[0];
searching = 0;
currentSearchTerm = null;
// use GET parameters to restore a flamegraphs state.
var params = get_params();
if (params.x && params.y)
zoom(find_group(document.querySelector('[x="' + params.x + '"][y="' + params.y + '"]')));
if (params.s) search(params.s);
}
// event listeners
window.addEventListener("click", function(e) {
var target = find_group(e.target);
if (target) {
if (target.nodeName == "a") {
if (e.ctrlKey === false) return;
e.preventDefault();
}
if (target.classList.contains("parent")) unzoom(true);
zoom(target);
if (!document.querySelector('.parent')) {
// we have basically done a clearzoom so clear the url
var params = get_params();
if (params.x) delete params.x;
if (params.y) delete params.y;
history.replaceState(null, null, parse_params(params));
unzoombtn.classList.add("hide");
return;
}
// set parameters for zoom state
var el = target.querySelector("rect");
if (el && el.attributes && el.attributes.y && el.attributes._orig_x) {
var params = get_params()
params.x = el.attributes._orig_x.value;
params.y = el.attributes.y.value;
history.replaceState(null, null, parse_params(params));
}
}
else if (e.target.id == "unzoom") clearzoom();
else if (e.target.id == "search") search_prompt();
else if (e.target.id == "ignorecase") toggle_ignorecase();
}, false)
// mouse-over for info
// show
window.addEventListener("mouseover", function(e) {
var target = find_group(e.target);
if (target) details.nodeValue = "Function: " + g_to_text(target);
}, false)
// clear
window.addEventListener("mouseout", function(e) {
var target = find_group(e.target);
if (target) details.nodeValue = ' ';
}, false)
// ctrl-F for search
// ctrl-I to toggle case-sensitive search
window.addEventListener("keydown",function (e) {
if (e.keyCode === 114 || (e.ctrlKey && e.keyCode === 70)) {
e.preventDefault();
search_prompt();
}
else if (e.ctrlKey && e.keyCode === 73) {
e.preventDefault();
toggle_ignorecase();
}
}, false)
// functions
function get_params() {
var params = {};
var paramsarr = window.location.search.substr(1).split('&');
for (var i = 0; i < paramsarr.length; ++i) {
var tmp = paramsarr[i].split("=");
if (!tmp[0] || !tmp[1]) continue;
params[tmp[0]] = decodeURIComponent(tmp[1]);
}
return params;
}
function parse_params(params) {
var uri = "?";
for (var key in params) {
uri += key + '=' + encodeURIComponent(params[key]) + '&';
}
if (uri.slice(-1) == "&")
uri = uri.substring(0, uri.length - 1);
if (uri == '?')
uri = window.location.href.split('?')[0];
return uri;
}
function find_child(node, selector) {
var children = node.querySelectorAll(selector);
if (children.length) return children[0];
}
function find_group(node) {
var parent = node.parentElement;
if (!parent) return;
if (parent.id == "frames") return node;
return find_group(parent);
}
function orig_save(e, attr, val) {
if (e.attributes["_orig_" + attr] != undefined) return;
if (e.attributes[attr] == undefined) return;
if (val == undefined) val = e.attributes[attr].value;
e.setAttribute("_orig_" + attr, val);
}
function orig_load(e, attr) {
if (e.attributes["_orig_"+attr] == undefined) return;
e.attributes[attr].value = e.attributes["_orig_" + attr].value;
e.removeAttribute("_orig_"+attr);
}
function g_to_text(e) {
var text = find_child(e, "title").firstChild.nodeValue;
return (text)
}
function g_to_func(e) {
var func = g_to_text(e);
// if there's any manipulation we want to do to the function
// name before it's searched, do it here before returning.
return (func);
}
function update_text(e) {
var r = find_child(e, "rect");
var t = find_child(e, "text");
var w = parseFloat(r.attributes.width.value) -3;
var txt = find_child(e, "title").textContent.replace(/\([^(]*\)$/,"");
t.attributes.x.value = parseFloat(r.attributes.x.value) + 3;
// Smaller than this size won't fit anything
if (w < 2 * 12 * 0.59) {
t.textContent = "";
return;
}
t.textContent = txt;
var sl = t.getSubStringLength(0, txt.length);
// check if only whitespace or if we can fit the entire string into width w
if (/^ *$/.test(txt) || sl < w)
return;
// this isn't perfect, but gives a good starting point
// and avoids calling getSubStringLength too often
var start = Math.floor((w/sl) * txt.length);
for (var x = start; x > 0; x = x-2) {
if (t.getSubStringLength(0, x + 2) <= w) {
t.textContent = txt.substring(0, x) + "..";
return;
}
}
t.textContent = "";
}
// zoom
function zoom_reset(e) {
if (e.attributes != undefined) {
orig_load(e, "x");
orig_load(e, "width");
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_reset(c[i]);
}
}
function zoom_child(e, x, ratio) {
if (e.attributes != undefined) {
if (e.attributes.x != undefined) {
orig_save(e, "x");
e.attributes.x.value = (parseFloat(e.attributes.x.value) - x - 10) * ratio + 10;
if (e.tagName == "text")
e.attributes.x.value = find_child(e.parentNode, "rect[x]").attributes.x.value + 3;
}
if (e.attributes.width != undefined) {
orig_save(e, "width");
e.attributes.width.value = parseFloat(e.attributes.width.value) * ratio;
}
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_child(c[i], x - 10, ratio);
}
}
function zoom_parent(e) {
if (e.attributes) {
if (e.attributes.x != undefined) {
orig_save(e, "x");
e.attributes.x.value = 10;
}
if (e.attributes.width != undefined) {
orig_save(e, "width");
e.attributes.width.value = parseInt(svg.width.baseVal.value) - (10 * 2);
}
}
if (e.childNodes == undefined) return;
for (var i = 0, c = e.childNodes; i < c.length; i++) {
zoom_parent(c[i]);
}
}
function zoom(node) {
var attr = find_child(node, "rect").attributes;
var width = parseFloat(attr.width.value);
var xmin = parseFloat(attr.x.value);
var xmax = parseFloat(xmin + width);
var ymin = parseFloat(attr.y.value);
var ratio = (svg.width.baseVal.value - 2 * 10) / width;
// XXX: Workaround for JavaScript float issues (fix me)
var fudge = 0.0001;
unzoombtn.classList.remove("hide");
var el = document.getElementById("frames").children;
for (var i = 0; i < el.length; i++) {
var e = el[i];
var a = find_child(e, "rect").attributes;
var ex = parseFloat(a.x.value);
var ew = parseFloat(a.width.value);
var upstack;
// Is it an ancestor
if (0 == 0) {
upstack = parseFloat(a.y.value) > ymin;
} else {
upstack = parseFloat(a.y.value) < ymin;
}
if (upstack) {
// Direct ancestor
if (ex <= xmin && (ex+ew+fudge) >= xmax) {
e.classList.add("parent");
zoom_parent(e);
update_text(e);
}
// not in current path
else
e.classList.add("hide");
}
// Children maybe
else {
// no common path
if (ex < xmin || ex + fudge >= xmax) {
e.classList.add("hide");
}
else {
zoom_child(e, xmin, ratio);
update_text(e);
}
}
}
search();
}
function unzoom(dont_update_text) {
unzoombtn.classList.add("hide");
var el = document.getElementById("frames").children;
for(var i = 0; i < el.length; i++) {
el[i].classList.remove("parent");
el[i].classList.remove("hide");
zoom_reset(el[i]);
if(!dont_update_text) update_text(el[i]);
}
search();
}
function clearzoom() {
unzoom();
// remove zoom state
var params = get_params();
if (params.x) delete params.x;
if (params.y) delete params.y;
history.replaceState(null, null, parse_params(params));
}
// search
function toggle_ignorecase() {
ignorecase = !ignorecase;
if (ignorecase) {
ignorecaseBtn.classList.add("show");
} else {
ignorecaseBtn.classList.remove("show");
}
reset_search();
search();
}
function reset_search() {
var el = document.querySelectorAll("#frames rect");
for (var i = 0; i < el.length; i++) {
orig_load(el[i], "fill")
}
var params = get_params();
delete params.s;
history.replaceState(null, null, parse_params(params));
}
function search_prompt() {
if (!searching) {
var term = prompt("Enter a search term (regexp " +
"allowed, eg: ^ext4_)"
+ (ignorecase ? ", ignoring case" : "")
+ "\nPress Ctrl-i to toggle case sensitivity", "");
if (term != null) search(term);
} else {
reset_search();
searching = 0;
currentSearchTerm = null;
searchbtn.classList.remove("show");
searchbtn.firstChild.nodeValue = "Search"
matchedtxt.classList.add("hide");
matchedtxt.firstChild.nodeValue = ""
}
}
function search(term) {
if (term) currentSearchTerm = term;
var re = new RegExp(currentSearchTerm, ignorecase ? 'i' : '');
var el = document.getElementById("frames").children;
var matches = new Object();
var maxwidth = 0;
for (var i = 0; i < el.length; i++) {
var e = el[i];
var func = g_to_func(e);
var rect = find_child(e, "rect");
if (func == null || rect == null)
continue;
// Save max width. Only works as we have a root frame
var w = parseFloat(rect.attributes.width.value);
if (w > maxwidth)
maxwidth = w;
if (func.match(re)) {
// highlight
var x = parseFloat(rect.attributes.x.value);
orig_save(rect, "fill");
rect.attributes.fill.value = "rgb(230,0,230)";
// remember matches
if (matches[x] == undefined) {
matches[x] = w;
} else {
if (w > matches[x]) {
// overwrite with parent
matches[x] = w;
}
}
searching = 1;
}
}
if (!searching)
return;
var params = get_params();
params.s = currentSearchTerm;
history.replaceState(null, null, parse_params(params));
searchbtn.classList.add("show");
searchbtn.firstChild.nodeValue = "Reset Search";
// calculate percent matched, excluding vertical overlap
var count = 0;
var lastx = -1;
var lastw = 0;
var keys = Array();
for (k in matches) {
if (matches.hasOwnProperty(k))
keys.push(k);
}
// sort the matched frames by their x location
// ascending, then width descending
keys.sort(function(a, b){
return a - b;
});
// Step through frames saving only the biggest bottom-up frames
// thanks to the sort order. This relies on the tree property
// where children are always smaller than their parents.
var fudge = 0.0001; // JavaScript floating point
for (var k in keys) {
var x = parseFloat(keys[k]);
var w = matches[keys[k]];
if (x >= lastx + lastw - fudge) {
count += w;
lastx = x;
lastw = w;
}
}
// display matched percent
matchedtxt.classList.remove("hide");
var pct = 100 * count / maxwidth;
if (pct != 100) pct = pct.toFixed(1)
matchedtxt.firstChild.nodeValue = "Matched: " + pct + "%";
}
]]>
</script>
<rect x="0.0" y="0" width="1200.0" height="630.0" fill="url(#background)" />
<text id="title" x="600.00" y="24" >Flame Graph</text>
<text id="details" x="10.00" y="613" > </text>
<text id="unzoom" x="10.00" y="24" class="hide">Reset Zoom</text>
<text id="search" x="1090.00" y="24" >Search</text>
<text id="ignorecase" x="1174.00" y="24" >ic</text>
<text id="matched" x="1090.00" y="613" > </text>
<g id="frames">
<g >
<title>submit_bio_noacct (22,302,984 samples, 0.18%)</title><rect x="589.6" y="277" width="2.2" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="592.62" y="287.5" ></text>
</g>
<g >
<title>ixgbe_update_stats (74,392,349 samples, 0.61%)</title><rect x="22.7" y="469" width="7.2" height="15.0" fill="rgb(217,55,13)" rx="2" ry="2" />
<text x="25.70" y="479.5" ></text>
</g>
<g >
<title>scheduler_tick (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="293" width="0.2" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text x="1123.49" y="303.5" ></text>
</g>
<g >
<title>syscall_exit_to_user_mode (4,656,696 samples, 0.04%)</title><rect x="1123.2" y="389" width="0.5" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text x="1126.24" y="399.5" ></text>
</g>
<g >
<title>mwait_idle_with_hints.constprop.0 (1,309,459 samples, 0.01%)</title><rect x="1189.6" y="373" width="0.2" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="1192.62" y="383.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irq (1,166,681 samples, 0.01%)</title><rect x="59.6" y="341" width="0.1" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="62.62" y="351.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (19,744,926 samples, 0.16%)</title><rect x="403.3" y="453" width="1.9" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="406.30" y="463.5" ></text>
</g>
<g >
<title>exit_to_user_mode_loop (2,430,757 samples, 0.02%)</title><rect x="1123.2" y="357" width="0.3" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="1126.24" y="367.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (67,377,381 samples, 0.55%)</title><rect x="104.7" y="405" width="6.5" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="107.71" y="415.5" ></text>
</g>
<g >
<title>search_binary_handler (73,446,926 samples, 0.60%)</title><rect x="45.5" y="421" width="7.1" height="15.0" fill="rgb(208,14,3)" rx="2" ry="2" />
<text x="48.55" y="431.5" ></text>
</g>
<g >
<title>perf_cpu_map__idx (3,562,672 samples, 0.03%)</title><rect x="68.9" y="453" width="0.3" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="71.90" y="463.5" ></text>
</g>
<g >
<title>get_page_from_freelist (1,788,381 samples, 0.01%)</title><rect x="1126.5" y="389" width="0.1" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="1129.47" y="399.5" ></text>
</g>
<g >
<title>do_anonymous_page (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="437" width="0.2" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="1130.42" y="447.5" ></text>
</g>
<g >
<title>ExecStoreBufferHeapTuple (21,596,624 samples, 0.18%)</title><rect x="121.9" y="549" width="2.1" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="124.94" y="559.5" ></text>
</g>
<g >
<title>clockevents_program_event (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="389" width="0.2" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="1185.08" y="399.5" ></text>
</g>
<g >
<title>cpuidle_enter (499,047,025 samples, 4.09%)</title><rect x="1130.3" y="469" width="48.2" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="1133.28" y="479.5" >cpui..</text>
</g>
<g >
<title>clear_page_erms (20,943,317 samples, 0.17%)</title><rect x="607.3" y="357" width="2.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="610.33" y="367.5" ></text>
</g>
<g >
<title>perf_poll (82,051,342 samples, 0.67%)</title><rect x="81.3" y="437" width="7.9" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="84.28" y="447.5" ></text>
</g>
<g >
<title>__mod_timer (64,385,252 samples, 0.53%)</title><rect x="16.5" y="277" width="6.2" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text x="19.48" y="287.5" ></text>
</g>
<g >
<title>get_mem_cgroup_from_mm (2,244,113 samples, 0.02%)</title><rect x="1126.1" y="405" width="0.2" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="1129.10" y="415.5" ></text>
</g>
<g >
<title>perf_mmap__push (75,847,424 samples, 0.62%)</title><rect x="89.2" y="517" width="7.3" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text x="92.20" y="527.5" ></text>
</g>
<g >
<title>exc_page_fault (26,647,528 samples, 0.22%)</title><rect x="1124.7" y="517" width="2.5" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="1127.67" y="527.5" ></text>
</g>
<g >
<title>call_timer_fn (84,901,418 samples, 0.70%)</title><rect x="1139.6" y="341" width="8.2" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1142.60" y="351.5" ></text>
</g>
<g >
<title>update_rq_clock (1,994,880 samples, 0.02%)</title><rect x="1181.2" y="453" width="0.2" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="1184.23" y="463.5" ></text>
</g>
<g >
<title>dequeue_task (1,794,773 samples, 0.01%)</title><rect x="1122.6" y="277" width="0.2" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="1125.60" y="287.5" ></text>
</g>
<g >
<title>start_secondary (541,804,420 samples, 4.44%)</title><rect x="1130.1" y="533" width="52.4" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="1133.13" y="543.5" >start..</text>
</g>
<g >
<title>_Fork (63,281,290 samples, 0.52%)</title><rect x="429.4" y="549" width="6.1" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text x="432.37" y="559.5" ></text>
</g>
<g >
<title>[psql] (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="501" width="25.8" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="1100.93" y="511.5" >[..</text>
</g>
<g >
<title>tick_nohz_stop_tick (9,667,506 samples, 0.08%)</title><rect x="1179.1" y="453" width="0.9" height="15.0" fill="rgb(235,139,33)" rx="2" ry="2" />
<text x="1182.09" y="463.5" ></text>
</g>
<g >
<title>asm_common_interrupt (22,984,339 samples, 0.19%)</title><rect x="401.1" y="501" width="2.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="404.08" y="511.5" ></text>
</g>
<g >
<title>blk_mq_dispatch_rq_list (41,329,388 samples, 0.34%)</title><rect x="581.3" y="149" width="4.0" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="584.34" y="159.5" ></text>
</g>
<g >
<title>handle_pte_fault (3,419,088 samples, 0.03%)</title><rect x="1116.5" y="325" width="0.3" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="1119.47" y="335.5" ></text>
</g>
<g >
<title>__x64_sys_sendto (39,667,816 samples, 0.32%)</title><rect x="605.5" y="501" width="3.9" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="608.53" y="511.5" ></text>
</g>
<g >
<title>cpu__max_cpu (1,181,317 samples, 0.01%)</title><rect x="68.8" y="437" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="71.78" y="447.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (10,154,115 samples, 0.08%)</title><rect x="1115.8" y="405" width="1.0" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1118.82" y="415.5" ></text>
</g>
<g >
<title>scsi_prepare_cmd (20,347,661 samples, 0.17%)</title><rect x="583.4" y="117" width="1.9" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text x="586.37" y="127.5" ></text>
</g>
<g >
<title>filemap_read (1,537,003,177 samples, 12.58%)</title><rect x="443.3" y="421" width="148.5" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="446.30" y="431.5" >filemap_read</text>
</g>
<g >
<title>rebalance_domains (19,744,926 samples, 0.16%)</title><rect x="403.3" y="421" width="1.9" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text x="406.30" y="431.5" ></text>
</g>
<g >
<title>rmqueue (1,788,381 samples, 0.01%)</title><rect x="1126.5" y="373" width="0.1" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text x="1129.47" y="383.5" ></text>
</g>
<g >
<title>add_timer (64,385,252 samples, 0.53%)</title><rect x="16.5" y="293" width="6.2" height="15.0" fill="rgb(210,25,6)" rx="2" ry="2" />
<text x="19.48" y="303.5" ></text>
</g>
<g >
<title>do_syscall_64 (63,281,290 samples, 0.52%)</title><rect x="429.4" y="517" width="6.1" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="432.37" y="527.5" ></text>
</g>
<g >
<title>do_syscall_64 (151,999,622 samples, 1.24%)</title><rect x="96.5" y="485" width="14.7" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="99.53" y="495.5" ></text>
</g>
<g >
<title>blk_mq_submit_bio (22,302,984 samples, 0.18%)</title><rect x="589.6" y="245" width="2.2" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text x="592.62" y="255.5" ></text>
</g>
<g >
<title>perf (672,029,760 samples, 5.50%)</title><rect x="52.6" y="565" width="65.0" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text x="55.64" y="575.5" >perf</text>
</g>
<g >
<title>__schedule (4,707,790 samples, 0.04%)</title><rect x="1122.6" y="293" width="0.5" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="1125.60" y="303.5" ></text>
</g>
<g >
<title>blk_mq_do_dispatch_sched (81,184,559 samples, 0.66%)</title><rect x="577.5" y="181" width="7.8" height="15.0" fill="rgb(209,18,4)" rx="2" ry="2" />
<text x="580.49" y="191.5" ></text>
</g>
<g >
<title>cpuidle_idle_call (76,827,900 samples, 0.63%)</title><rect x="1182.5" y="421" width="7.4" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1185.46" y="431.5" ></text>
</g>
<g >
<title>__alloc_pages (3,419,088 samples, 0.03%)</title><rect x="1116.5" y="277" width="0.3" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="1119.47" y="287.5" ></text>
</g>
<g >
<title>lapic_next_deadline (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="373" width="0.2" height="15.0" fill="rgb(222,82,19)" rx="2" ry="2" />
<text x="1185.08" y="383.5" ></text>
</g>
<g >
<title>__skb_datagram_iter (7,875,520 samples, 0.06%)</title><rect x="1119.6" y="261" width="0.7" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1122.56" y="271.5" ></text>
</g>
<g >
<title>do_syscall_64 (73,446,926 samples, 0.60%)</title><rect x="45.5" y="517" width="7.1" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="48.55" y="527.5" ></text>
</g>
<g >
<title>__alloc_pages (39,667,816 samples, 0.32%)</title><rect x="605.5" y="389" width="3.9" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="608.53" y="399.5" ></text>
</g>
<g >
<title>perf_mmap__read_init (65,655,582 samples, 0.54%)</title><rect x="111.2" y="517" width="6.4" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="114.21" y="527.5" ></text>
</g>
<g >
<title>__common_interrupt (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="405" width="1.9" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text x="1133.35" y="415.5" ></text>
</g>
<g >
<title>__x64_sys_sched_setaffinity (46,657,971 samples, 0.38%)</title><rect x="69.4" y="405" width="4.5" height="15.0" fill="rgb(227,104,25)" rx="2" ry="2" />
<text x="72.36" y="415.5" ></text>
</g>
<g >
<title>tick_irq_enter (1,563,961 samples, 0.01%)</title><rect x="1132.4" y="389" width="0.1" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="1135.37" y="399.5" ></text>
</g>
<g >
<title>do_syscall_64 (34,851,992 samples, 0.29%)</title><rect x="1117.0" y="389" width="3.3" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="1119.96" y="399.5" ></text>
</g>
<g >
<title>perf-exec (73,446,926 samples, 0.60%)</title><rect x="45.5" y="565" width="7.1" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="48.55" y="575.5" ></text>
</g>
<g >
<title>irq_exit_rcu (22,984,339 samples, 0.19%)</title><rect x="401.1" y="485" width="2.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="404.08" y="495.5" ></text>
</g>
<g >
<title>__GI___libc_write (74,457,327 samples, 0.61%)</title><rect x="74.1" y="437" width="7.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="77.09" y="447.5" ></text>
</g>
<g >
<title>fput (1,173,831 samples, 0.01%)</title><rect x="68.2" y="421" width="0.1" height="15.0" fill="rgb(225,96,23)" rx="2" ry="2" />
<text x="71.15" y="431.5" ></text>
</g>
<g >
<title>exit_mmap (73,446,926 samples, 0.60%)</title><rect x="45.5" y="341" width="7.1" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text x="48.55" y="351.5" ></text>
</g>
<g >
<title>affinity__set (1,181,317 samples, 0.01%)</title><rect x="68.8" y="453" width="0.1" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text x="71.78" y="463.5" ></text>
</g>
<g >
<title>unix_destruct_scm (1,936,961 samples, 0.02%)</title><rect x="1119.4" y="261" width="0.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="1122.38" y="271.5" ></text>
</g>
<g >
<title>__strlen_evex (46,167,764 samples, 0.38%)</title><rect x="419.2" y="517" width="4.4" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="422.16" y="527.5" ></text>
</g>
<g >
<title>hrtimer_try_to_cancel.part.0 (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="437" width="0.2" height="15.0" fill="rgb(253,221,52)" rx="2" ry="2" />
<text x="1185.08" y="447.5" ></text>
</g>
<g >
<title>__sys_recvfrom (1,573,230 samples, 0.01%)</title><rect x="1117.0" y="373" width="0.1" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text x="1119.96" y="383.5" ></text>
</g>
<g >
<title>tick_sched_handle (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="277" width="7.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="1185.46" y="287.5" ></text>
</g>
<g >
<title>socket_putmessage (23,457,052 samples, 0.19%)</title><rect x="1091.9" y="549" width="2.2" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text x="1094.86" y="559.5" ></text>
</g>
<g >
<title>rcu_idle_exit (1,372,280 samples, 0.01%)</title><rect x="1178.2" y="437" width="0.2" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="1181.22" y="447.5" ></text>
</g>
<g >
<title>ata_scsi_qc_complete (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="245" width="1.9" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="1133.35" y="255.5" ></text>
</g>
<g >
<title>rw_verify_area (40,015,512 samples, 0.33%)</title><rect x="593.7" y="469" width="3.9" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="596.73" y="479.5" ></text>
</g>
<g >
<title>blk_mq_sched_dispatch_requests (64,385,252 samples, 0.53%)</title><rect x="16.5" y="405" width="6.2" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="19.48" y="415.5" ></text>
</g>
<g >
<title>__intel_pmu_enable_all.constprop.0 (1,494,337 samples, 0.01%)</title><rect x="66.6" y="197" width="0.1" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text x="69.56" y="207.5" ></text>
</g>
<g >
<title>new_sync_read (1,578,471,971 samples, 12.92%)</title><rect x="441.3" y="469" width="152.4" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="444.26" y="479.5" >new_sync_read</text>
</g>
<g >
<title>__libc_pread (1,700,456,947 samples, 13.92%)</title><rect x="435.5" y="549" width="164.2" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="438.49" y="559.5" >__libc_pread</text>
</g>
<g >
<title>ext4_file_read_iter (1,578,471,971 samples, 12.92%)</title><rect x="441.3" y="453" width="152.4" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="444.26" y="463.5" >ext4_file_read_iter</text>
</g>
<g >
<title>update_sd_lb_stats.constprop.0 (19,744,926 samples, 0.16%)</title><rect x="403.3" y="373" width="1.9" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="406.30" y="383.5" ></text>
</g>
<g >
<title>blk_mq_dispatch_rq_list (64,385,252 samples, 0.53%)</title><rect x="16.5" y="373" width="6.2" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="19.48" y="383.5" ></text>
</g>
<g >
<title>__blk_mq_run_hw_queue (81,184,559 samples, 0.66%)</title><rect x="577.5" y="229" width="7.8" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="580.49" y="239.5" ></text>
</g>
<g >
<title>do_idle (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="437" width="7.5" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="1185.46" y="447.5" ></text>
</g>
<g >
<title>unmap_single_vma (73,446,926 samples, 0.60%)</title><rect x="45.5" y="309" width="7.1" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="48.55" y="319.5" ></text>
</g>
<g >
<title>submit_bio_noacct (74,661,841 samples, 0.61%)</title><rect x="38.3" y="277" width="7.2" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="41.33" y="287.5" ></text>
</g>
<g >
<title>skb_release_data (19,558,781 samples, 0.16%)</title><rect x="1117.5" y="277" width="1.9" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="1120.49" y="287.5" ></text>
</g>
<g >
<title>rcu_eqs_exit.constprop.0 (1,372,280 samples, 0.01%)</title><rect x="1178.2" y="421" width="0.2" height="15.0" fill="rgb(209,20,5)" rx="2" ry="2" />
<text x="1181.22" y="431.5" ></text>
</g>
<g >
<title>__blk_mq_delay_run_hw_queue (81,184,559 samples, 0.66%)</title><rect x="577.5" y="245" width="7.8" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="580.49" y="255.5" ></text>
</g>
<g >
<title>flush_smp_call_function_from_idle (9,953,266 samples, 0.08%)</title><rect x="1180.0" y="485" width="1.0" height="15.0" fill="rgb(225,92,22)" rx="2" ry="2" />
<text x="1183.03" y="495.5" ></text>
</g>
<g >
<title>__queue_delayed_work (64,385,252 samples, 0.53%)</title><rect x="16.5" y="309" width="6.2" height="15.0" fill="rgb(246,189,45)" rx="2" ry="2" />
<text x="19.48" y="319.5" ></text>
</g>
<g >
<title>xas_store (19,987,878 samples, 0.16%)</title><rect x="569.5" y="309" width="1.9" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="572.48" y="319.5" ></text>
</g>
<g >
<title>__sys_recvfrom (33,278,762 samples, 0.27%)</title><rect x="1117.1" y="357" width="3.2" height="15.0" fill="rgb(247,197,47)" rx="2" ry="2" />
<text x="1120.11" y="367.5" ></text>
</g>
<g >
<title>page_counter_cancel (9,283,557 samples, 0.08%)</title><rect x="1117.9" y="181" width="0.9" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="1120.85" y="191.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="389" width="0.1" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1192.88" y="399.5" ></text>
</g>
<g >
<title>exec_binprm (73,446,926 samples, 0.60%)</title><rect x="45.5" y="437" width="7.1" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="48.55" y="447.5" ></text>
</g>
<g >
<title>bio_alloc_bioset (21,197,359 samples, 0.17%)</title><rect x="587.6" y="293" width="2.0" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="590.57" y="303.5" ></text>
</g>
<g >
<title>__GI___execve (73,446,926 samples, 0.60%)</title><rect x="45.5" y="549" width="7.1" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="48.55" y="559.5" ></text>
</g>
<g >
<title>charge_memcg (22,409,228 samples, 0.18%)</title><rect x="559.5" y="293" width="2.2" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="562.54" y="303.5" ></text>
</g>
<g >
<title>psi_group_change (12,197,173 samples, 0.10%)</title><rect x="72.7" y="245" width="1.2" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="75.68" y="255.5" ></text>
</g>
<g >
<title>int8out (18,088,099 samples, 0.15%)</title><rect x="417.4" y="533" width="1.8" height="15.0" fill="rgb(241,167,40)" rx="2" ry="2" />
<text x="420.41" y="543.5" ></text>
</g>
<g >
<title>perf_event_task_tick (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="277" width="0.2" height="15.0" fill="rgb(205,3,0)" rx="2" ry="2" />
<text x="1123.49" y="287.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (82,051,342 samples, 0.67%)</title><rect x="81.3" y="453" width="7.9" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="84.28" y="463.5" ></text>
</g>
<g >
<title>iterate_groups (1,321,308 samples, 0.01%)</title><rect x="1122.8" y="277" width="0.1" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text x="1125.77" y="287.5" ></text>
</g>
<g >
<title>tick_program_event (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="357" width="0.1" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1192.88" y="367.5" ></text>
</g>
<g >
<title>[psql] (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="517" width="25.8" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="1100.93" y="527.5" >[..</text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (19,744,926 samples, 0.16%)</title><rect x="403.3" y="501" width="1.9" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="406.30" y="511.5" ></text>
</g>
<g >
<title>wake_q_add (8,265,745 samples, 0.07%)</title><rect x="69.5" y="309" width="0.8" height="15.0" fill="rgb(248,200,47)" rx="2" ry="2" />
<text x="72.47" y="319.5" ></text>
</g>
<g >
<title>perf_poll (84,409,880 samples, 0.69%)</title><rect x="96.5" y="421" width="8.2" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="99.53" y="431.5" ></text>
</g>
<g >
<title>__remove_hrtimer (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="421" width="0.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1185.08" y="431.5" ></text>
</g>
<g >
<title>blk_mq_requeue_work (64,385,252 samples, 0.53%)</title><rect x="16.5" y="485" width="6.2" height="15.0" fill="rgb(217,58,13)" rx="2" ry="2" />
<text x="19.48" y="495.5" ></text>
</g>
<g >
<title>pg_server_to_client (24,016,050 samples, 0.20%)</title><rect x="405.2" y="517" width="2.3" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text x="408.21" y="527.5" ></text>
</g>
<g >
<title>filemap_get_pages (1,254,951,126 samples, 10.27%)</title><rect x="470.5" y="405" width="121.3" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text x="473.55" y="415.5" >filemap_get_pages</text>
</g>
<g >
<title>start_thread (217,655,204 samples, 1.78%)</title><rect x="96.5" y="549" width="21.1" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="99.53" y="559.5" ></text>
</g>
<g >
<title>__blk_mq_sched_dispatch_requests (81,184,559 samples, 0.66%)</title><rect x="577.5" y="197" width="7.8" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="580.49" y="207.5" ></text>
</g>
<g >
<title>__x64_sys_read (67,058,175 samples, 0.55%)</title><rect x="10.0" y="485" width="6.5" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="13.00" y="495.5" ></text>
</g>
<g >
<title>__remove_hrtimer (1,768,336 samples, 0.01%)</title><rect x="1132.2" y="357" width="0.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1135.20" y="367.5" ></text>
</g>
<g >
<title>get_mem_cgroup_from_mm (22,711,892 samples, 0.19%)</title><rect x="561.7" y="293" width="2.2" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="564.70" y="303.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="533" width="0.2" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1130.42" y="543.5" ></text>
</g>
<g >
<title>mod_delayed_work_on (64,385,252 samples, 0.53%)</title><rect x="16.5" y="325" width="6.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="19.48" y="335.5" ></text>
</g>
<g >
<title>menu_reflect (2,012,759 samples, 0.02%)</title><rect x="1178.5" y="453" width="0.2" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="1181.49" y="463.5" ></text>
</g>
<g >
<title>xas_load (22,664,423 samples, 0.19%)</title><rect x="474.5" y="373" width="2.2" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="477.46" y="383.5" ></text>
</g>
<g >
<title>PQgetResult (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="453" width="25.8" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text x="1100.93" y="463.5" >P..</text>
</g>
<g >
<title>tick_nohz_idle_exit (5,561,674 samples, 0.05%)</title><rect x="1181.9" y="485" width="0.6" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="1184.92" y="495.5" ></text>
</g>
<g >
<title>rmqueue (139,090,117 samples, 1.14%)</title><rect x="493.1" y="277" width="13.4" height="15.0" fill="rgb(226,98,23)" rx="2" ry="2" />
<text x="496.08" y="287.5" ></text>
</g>
<g >
<title>cpuidle_enter (76,827,900 samples, 0.63%)</title><rect x="1182.5" y="405" width="7.4" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="1185.46" y="415.5" ></text>
</g>
<g >
<title>__handle_mm_fault (20,072,999 samples, 0.16%)</title><rect x="1125.0" y="469" width="1.9" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1127.97" y="479.5" ></text>
</g>
<g >
<title>arch_call_rest_init (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="485" width="7.5" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="1185.46" y="495.5" ></text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (1,119,473 samples, 0.01%)</title><rect x="62.3" y="261" width="0.1" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="65.25" y="271.5" ></text>
</g>
<g >
<title>__add_to_page_cache_locked (591,809,269 samples, 4.84%)</title><rect x="514.2" y="325" width="57.2" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="517.24" y="335.5" >__add_..</text>
</g>
<g >
<title>free_pcppages_bulk (4,157,215 samples, 0.03%)</title><rect x="1118.8" y="197" width="0.4" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="1121.75" y="207.5" ></text>
</g>
<g >
<title>_int_malloc (38,606,290 samples, 0.32%)</title><rect x="1123.7" y="549" width="3.7" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text x="1126.69" y="559.5" ></text>
</g>
<g >
<title>sock_sendmsg (39,667,816 samples, 0.32%)</title><rect x="605.5" y="469" width="3.9" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="608.53" y="479.5" ></text>
</g>
<g >
<title>tick_nohz_next_event (2,867,390 samples, 0.02%)</title><rect x="1178.8" y="421" width="0.3" height="15.0" fill="rgb(211,31,7)" rx="2" ry="2" />
<text x="1181.81" y="431.5" ></text>
</g>
<g >
<title>tick_sched_timer (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="341" width="0.2" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1123.49" y="351.5" ></text>
</g>
<g >
<title>blk_mq_run_hw_queues (64,385,252 samples, 0.53%)</title><rect x="16.5" y="469" width="6.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="19.48" y="479.5" ></text>
</g>
<g >
<title>rw_verify_area (20,877,329 samples, 0.17%)</title><rect x="439.2" y="485" width="2.1" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="442.24" y="495.5" ></text>
</g>
<g >
<title>ExecInterpExpr (110,754,183 samples, 0.91%)</title><rect x="152.0" y="517" width="10.7" height="15.0" fill="rgb(225,96,22)" rx="2" ry="2" />
<text x="154.98" y="527.5" ></text>
</g>
<g >
<title>ResourceOwnerForget (18,910,863 samples, 0.15%)</title><rect x="148.4" y="533" width="1.8" height="15.0" fill="rgb(235,142,33)" rx="2" ry="2" />
<text x="151.37" y="543.5" ></text>
</g>
<g >
<title>printtup_startup (46,167,764 samples, 0.38%)</title><rect x="419.2" y="533" width="4.4" height="15.0" fill="rgb(239,156,37)" rx="2" ry="2" />
<text x="422.16" y="543.5" ></text>
</g>
<g >
<title>cmd_record (296,475,790 samples, 2.43%)</title><rect x="52.6" y="501" width="28.7" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="55.64" y="511.5" >cm..</text>
</g>
<g >
<title>internal_putbytes (49,658,337 samples, 0.41%)</title><rect x="1015.5" y="549" width="4.8" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="1018.49" y="559.5" ></text>
</g>
<g >
<title>copy_mm (63,281,290 samples, 0.52%)</title><rect x="429.4" y="437" width="6.1" height="15.0" fill="rgb(229,114,27)" rx="2" ry="2" />
<text x="432.37" y="447.5" ></text>
</g>
<g >
<title>check_cpu_stall (1,119,473 samples, 0.01%)</title><rect x="62.3" y="149" width="0.1" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text x="65.25" y="159.5" ></text>
</g>
<g >
<title>sched_clock_cpu (1,411,083 samples, 0.01%)</title><rect x="1181.8" y="453" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="1184.79" y="463.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="357" width="7.1" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1185.46" y="367.5" ></text>
</g>
<g >
<title>perf_event_ctx_lock_nested.constprop.0 (5,565,918 samples, 0.05%)</title><rect x="67.6" y="389" width="0.6" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="70.62" y="399.5" ></text>
</g>
<g >
<title>__calc_delta (1,794,773 samples, 0.01%)</title><rect x="1122.6" y="229" width="0.2" height="15.0" fill="rgb(247,193,46)" rx="2" ry="2" />
<text x="1125.60" y="239.5" ></text>
</g>
<g >
<title>__update_load_avg_cfs_rq (1,544,821 samples, 0.01%)</title><rect x="1180.6" y="357" width="0.1" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text x="1183.56" y="367.5" ></text>
</g>
<g >
<title>load_balance (19,744,926 samples, 0.16%)</title><rect x="403.3" y="405" width="1.9" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text x="406.30" y="415.5" ></text>
</g>
<g >
<title>blk_mq_get_tag (22,302,984 samples, 0.18%)</title><rect x="589.6" y="213" width="2.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="592.62" y="223.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (74,457,327 samples, 0.61%)</title><rect x="74.1" y="421" width="7.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="77.09" y="431.5" ></text>
</g>
<g >
<title>dequeue_entity (1,794,773 samples, 0.01%)</title><rect x="1122.6" y="245" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="1125.60" y="255.5" ></text>
</g>
<g >
<title>__handle_irq_event_percpu (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="453" width="2.1" height="15.0" fill="rgb(236,147,35)" rx="2" ry="2" />
<text x="1021.15" y="463.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="421" width="0.2" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="1123.49" y="431.5" ></text>
</g>
<g >
<title>sched_clock (1,393,531 samples, 0.01%)</title><rect x="1178.4" y="421" width="0.1" height="15.0" fill="rgb(212,33,7)" rx="2" ry="2" />
<text x="1181.35" y="431.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (1,403,893 samples, 0.01%)</title><rect x="1121.6" y="373" width="0.1" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="1124.56" y="383.5" ></text>
</g>
<g >
<title>read_tsc (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="325" width="0.1" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text x="1192.88" y="335.5" ></text>
</g>
<g >
<title>__evlist__disable.constprop.0 (64,324,564 samples, 0.53%)</title><rect x="52.6" y="485" width="6.3" height="15.0" fill="rgb(234,136,32)" rx="2" ry="2" />
<text x="55.64" y="495.5" ></text>
</g>
<g >
<title>ret_from_fork (26,186,933 samples, 0.21%)</title><rect x="1127.6" y="549" width="2.5" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="1130.60" y="559.5" ></text>
</g>
<g >
<title>sock_poll (1,870,602 samples, 0.02%)</title><rect x="1123.1" y="341" width="0.1" height="15.0" fill="rgb(223,84,20)" rx="2" ry="2" />
<text x="1126.06" y="351.5" ></text>
</g>
<g >
<title>event_function_call (81,632,253 samples, 0.67%)</title><rect x="59.7" y="341" width="7.9" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="62.73" y="351.5" ></text>
</g>
<g >
<title>get_page_from_freelist (1,500,491 samples, 0.01%)</title><rect x="1116.7" y="261" width="0.1" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="1119.66" y="271.5" ></text>
</g>
<g >
<title>balance_fair (12,456,426 samples, 0.10%)</title><rect x="70.3" y="245" width="1.2" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="73.27" y="255.5" ></text>
</g>
<g >
<title>free_compound_page (17,226,814 samples, 0.14%)</title><rect x="1117.5" y="245" width="1.7" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1120.49" y="255.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (1,716,767 samples, 0.01%)</title><rect x="1120.3" y="421" width="0.2" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1123.33" y="431.5" ></text>
</g>
<g >
<title>native_sched_clock (1,393,531 samples, 0.01%)</title><rect x="1178.4" y="405" width="0.1" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="1181.35" y="415.5" ></text>
</g>
<g >
<title>tick_program_event (3,175,974 samples, 0.03%)</title><rect x="1179.3" y="405" width="0.3" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1182.32" y="415.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (161,526,203 samples, 1.32%)</title><rect x="1132.2" y="437" width="15.6" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="1135.20" y="447.5" ></text>
</g>
<g >
<title>ata_scsi_translate (20,981,727 samples, 0.17%)</title><rect x="581.3" y="69" width="2.1" height="15.0" fill="rgb(238,152,36)" rx="2" ry="2" />
<text x="584.34" y="79.5" ></text>
</g>
<g >
<title>menu_select (4,243,293 samples, 0.03%)</title><rect x="1178.7" y="453" width="0.4" height="15.0" fill="rgb(242,172,41)" rx="2" ry="2" />
<text x="1181.68" y="463.5" ></text>
</g>
<g >
<title>clockevents_program_event (3,175,974 samples, 0.03%)</title><rect x="1179.3" y="389" width="0.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="1182.32" y="399.5" ></text>
</g>
<g >
<title>sync_regs (1,728,805 samples, 0.01%)</title><rect x="1127.2" y="533" width="0.2" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="1130.25" y="543.5" ></text>
</g>
<g >
<title>exit_to_user_mode_prepare (1,973,246 samples, 0.02%)</title><rect x="68.3" y="405" width="0.2" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="71.27" y="415.5" ></text>
</g>
<g >
<title>add_to_page_cache_lru (734,763,487 samples, 6.01%)</title><rect x="506.5" y="341" width="71.0" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text x="509.52" y="351.5" >add_to_p..</text>
</g>
<g >
<title>page_counter_uncharge (9,283,557 samples, 0.08%)</title><rect x="1117.9" y="197" width="0.9" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="1120.85" y="207.5" ></text>
</g>
<g >
<title>scsi_queue_rq (41,329,388 samples, 0.34%)</title><rect x="581.3" y="133" width="4.0" height="15.0" fill="rgb(219,67,16)" rx="2" ry="2" />
<text x="584.34" y="143.5" ></text>
</g>
<g >
<title>run_timer_softirq (84,901,418 samples, 0.70%)</title><rect x="1139.6" y="373" width="8.2" height="15.0" fill="rgb(249,206,49)" rx="2" ry="2" />
<text x="1142.60" y="383.5" ></text>
</g>
<g >
<title>affinity__set (64,324,564 samples, 0.53%)</title><rect x="52.6" y="453" width="6.3" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text x="55.64" y="463.5" ></text>
</g>
<g >
<title>_nohz_idle_balance.constprop.0.isra.0 (71,808,717 samples, 0.59%)</title><rect x="1132.6" y="357" width="7.0" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1135.63" y="367.5" ></text>
</g>
<g >
<title>tick_nohz_idle_enter (1,411,083 samples, 0.01%)</title><rect x="1181.8" y="485" width="0.1" height="15.0" fill="rgb(250,211,50)" rx="2" ry="2" />
<text x="1184.79" y="495.5" ></text>
</g>
<g >
<title>perf_adjust_freq_unthr_context (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="261" width="0.2" height="15.0" fill="rgb(236,146,35)" rx="2" ry="2" />
<text x="1123.49" y="271.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (12,199,312 samples, 0.10%)</title><rect x="1122.1" y="357" width="1.1" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="1125.06" y="367.5" ></text>
</g>
<g >
<title>__hrtimer_start_range_ns (5,588,592 samples, 0.05%)</title><rect x="1179.1" y="421" width="0.5" height="15.0" fill="rgb(216,53,12)" rx="2" ry="2" />
<text x="1182.09" y="431.5" ></text>
</g>
<g >
<title>__alloc_pages (1,788,381 samples, 0.01%)</title><rect x="1126.5" y="405" width="0.1" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="1129.47" y="415.5" ></text>
</g>
<g >
<title>__pagevec_lru_add (1,369,283 samples, 0.01%)</title><rect x="1126.6" y="389" width="0.2" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="1129.64" y="399.5" ></text>
</g>
<g >
<title>run_builtin (296,475,790 samples, 2.43%)</title><rect x="52.6" y="517" width="28.7" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="55.64" y="527.5" >ru..</text>
</g>
<g >
<title>ahci_handle_port_interrupt (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="405" width="2.1" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="1021.15" y="415.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (5,588,592 samples, 0.05%)</title><rect x="1179.1" y="437" width="0.5" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1182.09" y="447.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (63,281,290 samples, 0.52%)</title><rect x="429.4" y="533" width="6.1" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="432.37" y="543.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (1,768,336 samples, 0.01%)</title><rect x="1132.2" y="373" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="1135.20" y="383.5" ></text>
</g>
<g >
<title>kworker/u290:2 (74,661,841 samples, 0.61%)</title><rect x="38.3" y="565" width="7.2" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="41.33" y="575.5" ></text>
</g>
<g >
<title>page_counter_uncharge (2,132,227 samples, 0.02%)</title><rect x="1117.6" y="213" width="0.3" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="1120.65" y="223.5" ></text>
</g>
<g >
<title>AllocSetReset (83,312,419 samples, 0.68%)</title><rect x="136.6" y="533" width="8.1" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="139.65" y="543.5" ></text>
</g>
<g >
<title>process_one_work (74,392,349 samples, 0.61%)</title><rect x="22.7" y="501" width="7.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="25.70" y="511.5" ></text>
</g>
<g >
<title>sched_setaffinity (45,451,795 samples, 0.37%)</title><rect x="69.5" y="389" width="4.4" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="72.47" y="399.5" ></text>
</g>
<g >
<title>remote_function (53,650,833 samples, 0.44%)</title><rect x="62.4" y="293" width="5.1" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="65.36" y="303.5" ></text>
</g>
<g >
<title>kernel_clone (63,281,290 samples, 0.52%)</title><rect x="429.4" y="469" width="6.1" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text x="432.37" y="479.5" ></text>
</g>
<g >
<title>enqueue_hrtimer (2,229,102 samples, 0.02%)</title><rect x="1179.1" y="405" width="0.2" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="1182.09" y="415.5" ></text>
</g>
<g >
<title>tick_nohz_idle_exit (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="421" width="0.1" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="1192.88" y="431.5" ></text>
</g>
<g >
<title>pg_ulltoa_n.part.0 (78,388,976 samples, 0.64%)</title><rect x="1022.0" y="549" width="7.5" height="15.0" fill="rgb(252,217,52)" rx="2" ry="2" />
<text x="1024.95" y="559.5" ></text>
</g>
<g >
<title>unix_stream_sendmsg (39,667,816 samples, 0.32%)</title><rect x="605.5" y="453" width="3.9" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="608.53" y="463.5" ></text>
</g>
<g >
<title>place_entity (1,345,202 samples, 0.01%)</title><rect x="1180.7" y="389" width="0.1" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="1183.71" y="399.5" ></text>
</g>
<g >
<title>rcu_sched_clock_irq (1,119,473 samples, 0.01%)</title><rect x="62.3" y="165" width="0.1" height="15.0" fill="rgb(208,15,3)" rx="2" ry="2" />
<text x="65.25" y="175.5" ></text>
</g>
<g >
<title>__free_one_page (4,157,215 samples, 0.03%)</title><rect x="1118.8" y="181" width="0.4" height="15.0" fill="rgb(210,25,6)" rx="2" ry="2" />
<text x="1121.75" y="191.5" ></text>
</g>
<g >
<title>AllocSetAlloc (66,407,459 samples, 0.54%)</title><rect x="130.2" y="533" width="6.4" height="15.0" fill="rgb(231,124,29)" rx="2" ry="2" />
<text x="133.23" y="543.5" ></text>
</g>
<g >
<title>mwait_idle_with_hints.constprop.0 (311,153,852 samples, 2.55%)</title><rect x="1148.2" y="437" width="30.0" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="1151.16" y="447.5" >mw..</text>
</g>
<g >
<title>delayed_work_timer_fn (42,858,843 samples, 0.35%)</title><rect x="1143.7" y="325" width="4.1" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="1146.66" y="335.5" ></text>
</g>
<g >
<title>ret_from_fork (64,385,252 samples, 0.53%)</title><rect x="16.5" y="549" width="6.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="19.48" y="559.5" ></text>
</g>
<g >
<title>wb_workfn (74,661,841 samples, 0.61%)</title><rect x="38.3" y="485" width="7.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="41.33" y="495.5" ></text>
</g>
<g >
<title>sock_recvmsg (30,907,983 samples, 0.25%)</title><rect x="1117.3" y="341" width="3.0" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="1120.34" y="351.5" ></text>
</g>
<g >
<title>__pollwait (82,051,342 samples, 0.67%)</title><rect x="81.3" y="421" width="7.9" height="15.0" fill="rgb(227,102,24)" rx="2" ry="2" />
<text x="84.28" y="431.5" ></text>
</g>
<g >
<title>_perf_ioctl (85,917,224 samples, 0.70%)</title><rect x="59.3" y="389" width="8.3" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="62.32" y="399.5" ></text>
</g>
<g >
<title>process_one_work (87,484,032 samples, 0.72%)</title><rect x="29.9" y="501" width="8.4" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="32.88" y="511.5" ></text>
</g>
<g >
<title>enqueue_entity (4,931,790 samples, 0.04%)</title><rect x="1180.2" y="389" width="0.5" height="15.0" fill="rgb(218,62,15)" rx="2" ry="2" />
<text x="1183.23" y="399.5" ></text>
</g>
<g >
<title>rest_init (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="469" width="7.5" height="15.0" fill="rgb(252,217,51)" rx="2" ry="2" />
<text x="1185.46" y="479.5" ></text>
</g>
<g >
<title>__slab_free (2,331,967 samples, 0.02%)</title><rect x="1119.2" y="245" width="0.2" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text x="1122.15" y="255.5" ></text>
</g>
<g >
<title>try_charge_memcg (3,758,047 samples, 0.03%)</title><rect x="1125.7" y="389" width="0.4" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text x="1128.74" y="399.5" ></text>
</g>
<g >
<title>lru_cache_add_inactive_or_unevictable (1,369,283 samples, 0.01%)</title><rect x="1126.6" y="421" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="1129.64" y="431.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (5,734,510 samples, 0.05%)</title><rect x="1119.8" y="229" width="0.5" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="1122.77" y="239.5" ></text>
</g>
<g >
<title>psi_task_change (1,567,925 samples, 0.01%)</title><rect x="1180.8" y="405" width="0.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="1183.84" y="415.5" ></text>
</g>
<g >
<title>psi_group_change (25,681,033 samples, 0.21%)</title><rect x="1127.6" y="453" width="2.5" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="1130.64" y="463.5" ></text>
</g>
<g >
<title>mutex_unlock (2,269,841 samples, 0.02%)</title><rect x="59.0" y="405" width="0.2" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="61.99" y="415.5" ></text>
</g>
<g >
<title>vmacache_find (2,552,013 samples, 0.02%)</title><rect x="1116.0" y="341" width="0.3" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="1119.05" y="351.5" ></text>
</g>
<g >
<title>cpuidle_idle_call (516,584,157 samples, 4.23%)</title><rect x="1130.1" y="485" width="49.9" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1133.13" y="495.5" >cpuid..</text>
</g>
<g >
<title>leave_mm (3,723,054 samples, 0.03%)</title><rect x="1147.8" y="437" width="0.4" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text x="1150.80" y="447.5" ></text>
</g>
<g >
<title>postgres (10,147,595,545 samples, 83.07%)</title><rect x="117.6" y="565" width="980.2" height="15.0" fill="rgb(233,131,31)" rx="2" ry="2" />
<text x="120.56" y="575.5" >postgres</text>
</g>
<g >
<title>x86_pmu_enable (14,465,449 samples, 0.12%)</title><rect x="66.1" y="229" width="1.4" height="15.0" fill="rgb(244,179,43)" rx="2" ry="2" />
<text x="69.15" y="239.5" ></text>
</g>
<g >
<title>perf_evlist__poll_thread (217,655,204 samples, 1.78%)</title><rect x="96.5" y="533" width="21.1" height="15.0" fill="rgb(236,142,34)" rx="2" ry="2" />
<text x="99.53" y="543.5" ></text>
</g>
<g >
<title>wb_writeback (74,661,841 samples, 0.61%)</title><rect x="38.3" y="453" width="7.2" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="41.33" y="463.5" ></text>
</g>
<g >
<title>zap_pte_range (73,446,926 samples, 0.60%)</title><rect x="45.5" y="261" width="7.1" height="15.0" fill="rgb(231,120,28)" rx="2" ry="2" />
<text x="48.55" y="271.5" ></text>
</g>
<g >
<title>blk_mq_sched_dispatch_requests (81,184,559 samples, 0.66%)</title><rect x="577.5" y="213" width="7.8" height="15.0" fill="rgb(209,22,5)" rx="2" ry="2" />
<text x="580.49" y="223.5" ></text>
</g>
<g >
<title>unix_stream_recvmsg (30,907,983 samples, 0.25%)</title><rect x="1117.3" y="325" width="3.0" height="15.0" fill="rgb(251,213,51)" rx="2" ry="2" />
<text x="1120.34" y="335.5" ></text>
</g>
<g >
<title>mpage_prepare_extent_to_map (74,661,841 samples, 0.61%)</title><rect x="38.3" y="357" width="7.2" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="41.33" y="367.5" ></text>
</g>
<g >
<title>native_write_msr (8,698,892 samples, 0.07%)</title><rect x="66.7" y="197" width="0.8" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="69.71" y="207.5" ></text>
</g>
<g >
<title>find_busiest_group (19,744,926 samples, 0.16%)</title><rect x="403.3" y="389" width="1.9" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text x="406.30" y="399.5" ></text>
</g>
<g >
<title>do_sys_poll (15,971,698 samples, 0.13%)</title><rect x="1121.7" y="373" width="1.5" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="1124.69" y="383.5" ></text>
</g>
<g >
<title>psi_task_switch (1,591,709 samples, 0.01%)</title><rect x="1122.9" y="277" width="0.2" height="15.0" fill="rgb(230,118,28)" rx="2" ry="2" />
<text x="1125.90" y="287.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (82,051,342 samples, 0.67%)</title><rect x="81.3" y="517" width="7.9" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="84.28" y="527.5" ></text>
</g>
<g >
<title>exc_page_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="517" width="0.2" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="1130.42" y="527.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_statistics.constprop.0 (22,409,228 samples, 0.18%)</title><rect x="559.5" y="277" width="2.2" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="562.54" y="287.5" ></text>
</g>
<g >
<title>do_execveat_common.isra.0 (73,446,926 samples, 0.60%)</title><rect x="45.5" y="485" width="7.1" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="48.55" y="495.5" ></text>
</g>
<g >
<title>bprm_execve.part.0 (73,446,926 samples, 0.60%)</title><rect x="45.5" y="453" width="7.1" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="48.55" y="463.5" ></text>
</g>
<g >
<title>submit_bio (74,661,841 samples, 0.61%)</title><rect x="38.3" y="293" width="7.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="41.33" y="303.5" ></text>
</g>
<g >
<title>simple_copy_to_iter (5,734,510 samples, 0.05%)</title><rect x="1119.8" y="245" width="0.5" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text x="1122.77" y="255.5" ></text>
</g>
<g >
<title>__x64_sys_poll (17,375,591 samples, 0.14%)</title><rect x="1121.6" y="389" width="1.6" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="1124.56" y="399.5" ></text>
</g>
<g >
<title>ksys_read (67,058,175 samples, 0.55%)</title><rect x="10.0" y="469" width="6.5" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="13.00" y="479.5" ></text>
</g>
<g >
<title>__schedule (37,186,050 samples, 0.30%)</title><rect x="70.3" y="277" width="3.6" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="73.27" y="287.5" ></text>
</g>
<g >
<title>ext4_readahead (66,651,781 samples, 0.55%)</title><rect x="585.3" y="325" width="6.5" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text x="588.33" y="335.5" ></text>
</g>
<g >
<title>handle_pte_fault (17,088,464 samples, 0.14%)</title><rect x="1125.3" y="453" width="1.6" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="1128.26" y="463.5" ></text>
</g>
<g >
<title>do_sys_poll (82,051,342 samples, 0.67%)</title><rect x="81.3" y="469" width="7.9" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="84.28" y="479.5" ></text>
</g>
<g >
<title>ext4_writepages (74,661,841 samples, 0.61%)</title><rect x="38.3" y="373" width="7.2" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="41.33" y="383.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (73,446,926 samples, 0.60%)</title><rect x="45.5" y="533" width="7.1" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="48.55" y="543.5" ></text>
</g>
<g >
<title>anon_vma_interval_tree_insert (63,281,290 samples, 0.52%)</title><rect x="429.4" y="373" width="6.1" height="15.0" fill="rgb(226,99,23)" rx="2" ry="2" />
<text x="432.37" y="383.5" ></text>
</g>
<g >
<title>schedule_hrtimeout_range_clock (4,707,790 samples, 0.04%)</title><rect x="1122.6" y="325" width="0.5" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text x="1125.60" y="335.5" ></text>
</g>
<g >
<title>palloc (17,212,353 samples, 0.14%)</title><rect x="1020.3" y="549" width="1.7" height="15.0" fill="rgb(211,29,7)" rx="2" ry="2" />
<text x="1023.29" y="559.5" ></text>
</g>
<g >
<title>__x64_sys_execve (73,446,926 samples, 0.60%)</title><rect x="45.5" y="501" width="7.1" height="15.0" fill="rgb(247,193,46)" rx="2" ry="2" />
<text x="48.55" y="511.5" ></text>
</g>
<g >
<title>LockBufHdr (19,917,325 samples, 0.16%)</title><rect x="126.3" y="549" width="2.0" height="15.0" fill="rgb(236,143,34)" rx="2" ry="2" />
<text x="129.34" y="559.5" ></text>
</g>
<g >
<title>lru_cache_add (1,369,283 samples, 0.01%)</title><rect x="1126.6" y="405" width="0.2" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="1129.64" y="415.5" ></text>
</g>
<g >
<title>hrtimer_start_range_ns (2,126,169 samples, 0.02%)</title><rect x="1182.3" y="453" width="0.2" height="15.0" fill="rgb(244,179,42)" rx="2" ry="2" />
<text x="1185.26" y="463.5" ></text>
</g>
<g >
<title>ata_qc_complete_multiple (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="293" width="1.9" height="15.0" fill="rgb(205,1,0)" rx="2" ry="2" />
<text x="1133.35" y="303.5" ></text>
</g>
<g >
<title>__libc_recv (1,547,083 samples, 0.01%)</title><rect x="1097.8" y="533" width="0.1" height="15.0" fill="rgb(233,129,31)" rx="2" ry="2" />
<text x="1100.78" y="543.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="485" width="2.1" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1016.44" y="495.5" ></text>
</g>
<g >
<title>__mod_lruvec_page_state (1,409,510 samples, 0.01%)</title><rect x="1126.8" y="405" width="0.1" height="15.0" fill="rgb(245,185,44)" rx="2" ry="2" />
<text x="1129.77" y="415.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (1,369,283 samples, 0.01%)</title><rect x="1126.6" y="373" width="0.2" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="1129.64" y="383.5" ></text>
</g>
<g >
<title>free_unref_page (4,157,215 samples, 0.03%)</title><rect x="1118.8" y="229" width="0.4" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="1121.75" y="239.5" ></text>
</g>
<g >
<title>__ata_qc_complete (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="261" width="1.9" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="1133.35" y="271.5" ></text>
</g>
<g >
<title>scsi_init_command (20,347,661 samples, 0.17%)</title><rect x="583.4" y="101" width="1.9" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="586.37" y="111.5" ></text>
</g>
<g >
<title>irq_exit_rcu (158,193,906 samples, 1.29%)</title><rect x="1132.5" y="405" width="15.3" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1135.52" y="415.5" ></text>
</g>
<g >
<title>do_syscall_64 (98,687,853 samples, 0.81%)</title><rect x="58.9" y="437" width="9.6" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="61.93" y="447.5" ></text>
</g>
<g >
<title>ext4_bio_write_page (74,661,841 samples, 0.61%)</title><rect x="38.3" y="309" width="7.2" height="15.0" fill="rgb(228,105,25)" rx="2" ry="2" />
<text x="41.33" y="319.5" ></text>
</g>
<g >
<title>stop_one_cpu_nowait (8,265,745 samples, 0.07%)</title><rect x="69.5" y="325" width="0.8" height="15.0" fill="rgb(240,161,38)" rx="2" ry="2" />
<text x="72.47" y="335.5" ></text>
</g>
<g >
<title>security_file_permission (40,015,512 samples, 0.33%)</title><rect x="593.7" y="453" width="3.9" height="15.0" fill="rgb(225,96,23)" rx="2" ry="2" />
<text x="596.73" y="463.5" ></text>
</g>
<g >
<title>native_write_msr (3,175,974 samples, 0.03%)</title><rect x="1179.3" y="373" width="0.3" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="1182.32" y="383.5" ></text>
</g>
<g >
<title>__x64_sys_recvfrom (33,278,762 samples, 0.27%)</title><rect x="1117.1" y="373" width="3.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="1120.11" y="383.5" ></text>
</g>
<g >
<title>record__mmap_read_evlist.constprop.0 (75,847,424 samples, 0.62%)</title><rect x="89.2" y="533" width="7.3" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text x="92.20" y="543.5" ></text>
</g>
<g >
<title>poll_freewait (67,377,381 samples, 0.55%)</title><rect x="104.7" y="437" width="6.5" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="107.71" y="447.5" ></text>
</g>
<g >
<title>rcu_idle_exit (1,375,369 samples, 0.01%)</title><rect x="1189.8" y="389" width="0.1" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="1192.75" y="399.5" ></text>
</g>
<g >
<title>scsi_complete (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="437" width="2.1" height="15.0" fill="rgb(210,23,5)" rx="2" ry="2" />
<text x="1016.44" y="447.5" ></text>
</g>
<g >
<title>irq_enter_rcu (1,563,961 samples, 0.01%)</title><rect x="1132.4" y="405" width="0.1" height="15.0" fill="rgb(208,15,3)" rx="2" ry="2" />
<text x="1135.37" y="415.5" ></text>
</g>
<g >
<title>ctx_resched (25,908,748 samples, 0.21%)</title><rect x="65.0" y="245" width="2.5" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="68.04" y="255.5" ></text>
</g>
<g >
<title>tick_sched_timer (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="293" width="7.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1185.46" y="303.5" ></text>
</g>
<g >
<title>seq_read_iter (67,058,175 samples, 0.55%)</title><rect x="10.0" y="405" width="6.5" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="13.00" y="415.5" ></text>
</g>
<g >
<title>ahci_single_level_irq_intr (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="341" width="1.9" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text x="1133.35" y="351.5" ></text>
</g>
<g >
<title>bprm_execve (73,446,926 samples, 0.60%)</title><rect x="45.5" y="469" width="7.1" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="48.55" y="479.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (28,280,564 samples, 0.23%)</title><rect x="1124.5" y="533" width="2.7" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1127.52" y="543.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (1,119,473 samples, 0.01%)</title><rect x="62.3" y="245" width="0.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="65.25" y="255.5" ></text>
</g>
<g >
<title>cmd_record (157,898,766 samples, 1.29%)</title><rect x="81.3" y="549" width="15.2" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="84.28" y="559.5" ></text>
</g>
<g >
<title>__add_to_page_cache_locked (85,577,876 samples, 0.70%)</title><rect x="476.7" y="341" width="8.2" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="479.65" y="351.5" ></text>
</g>
<g >
<title>remove_wait_queue (67,377,381 samples, 0.55%)</title><rect x="104.7" y="421" width="6.5" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text x="107.71" y="431.5" ></text>
</g>
<g >
<title>__poll (82,051,342 samples, 0.67%)</title><rect x="81.3" y="533" width="7.9" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="84.28" y="543.5" ></text>
</g>
<g >
<title>smp_call_function_single (75,599,353 samples, 0.62%)</title><rect x="60.3" y="325" width="7.3" height="15.0" fill="rgb(223,83,19)" rx="2" ry="2" />
<text x="63.31" y="335.5" ></text>
</g>
<g >
<title>worker_thread (87,484,032 samples, 0.72%)</title><rect x="29.9" y="517" width="8.4" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="32.88" y="527.5" ></text>
</g>
<g >
<title>alloc_skb_with_frags (39,667,816 samples, 0.32%)</title><rect x="605.5" y="421" width="3.9" height="15.0" fill="rgb(228,107,25)" rx="2" ry="2" />
<text x="608.53" y="431.5" ></text>
</g>
<g >
<title>record__pushfn (74,457,327 samples, 0.61%)</title><rect x="74.1" y="453" width="7.2" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="77.09" y="463.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="405" width="0.2" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1123.49" y="415.5" ></text>
</g>
<g >
<title>kthread (74,392,349 samples, 0.61%)</title><rect x="22.7" y="533" width="7.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="25.70" y="543.5" ></text>
</g>
<g >
<title>do_idle (541,804,420 samples, 4.44%)</title><rect x="1130.1" y="501" width="52.4" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="1133.13" y="511.5" >do_idle</text>
</g>
<g >
<title>proc_reg_read_iter (67,058,175 samples, 0.55%)</title><rect x="10.0" y="421" width="6.5" height="15.0" fill="rgb(223,86,20)" rx="2" ry="2" />
<text x="13.00" y="431.5" ></text>
</g>
<g >
<title>__blk_mq_end_request (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="373" width="2.1" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="1016.44" y="383.5" ></text>
</g>
<g >
<title>update_process_times (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="261" width="7.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="1185.46" y="271.5" ></text>
</g>
<g >
<title>alloc_pages (39,667,816 samples, 0.32%)</title><rect x="605.5" y="405" width="3.9" height="15.0" fill="rgb(210,25,6)" rx="2" ry="2" />
<text x="608.53" y="415.5" ></text>
</g>
<g >
<title>clockevents_program_event (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="341" width="0.1" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="1192.88" y="351.5" ></text>
</g>
<g >
<title>cpuidle_reflect (2,012,759 samples, 0.02%)</title><rect x="1178.5" y="469" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="1181.49" y="479.5" ></text>
</g>
<g >
<title>cpuidle_enter_state (75,452,531 samples, 0.62%)</title><rect x="1182.5" y="389" width="7.3" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="1185.46" y="399.5" ></text>
</g>
<g >
<title>__switch_to (2,113,978 samples, 0.02%)</title><rect x="1121.4" y="421" width="0.2" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text x="1124.35" y="431.5" ></text>
</g>
<g >
<title>asm_common_interrupt (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="533" width="2.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="1016.44" y="543.5" ></text>
</g>
<g >
<title>tick_nohz_get_sleep_length (2,867,390 samples, 0.02%)</title><rect x="1178.8" y="437" width="0.3" height="15.0" fill="rgb(229,112,26)" rx="2" ry="2" />
<text x="1181.81" y="447.5" ></text>
</g>
<g >
<title>blk_mq_complete_request (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="229" width="1.9" height="15.0" fill="rgb(210,27,6)" rx="2" ry="2" />
<text x="1133.35" y="239.5" ></text>
</g>
<g >
<title>asm_common_interrupt (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="437" width="1.9" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="1133.35" y="447.5" ></text>
</g>
<g >
<title>x86_64_start_reservations (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="517" width="7.5" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="1185.46" y="527.5" ></text>
</g>
<g >
<title>quiet_vmstat (4,078,914 samples, 0.03%)</title><rect x="1179.6" y="437" width="0.4" height="15.0" fill="rgb(249,204,48)" rx="2" ry="2" />
<text x="1182.63" y="447.5" ></text>
</g>
<g >
<title>AllocSetAlloc (19,054,494 samples, 0.16%)</title><rect x="117.6" y="549" width="1.8" height="15.0" fill="rgb(231,124,29)" rx="2" ry="2" />
<text x="120.56" y="559.5" ></text>
</g>
<g >
<title>mmput (73,446,926 samples, 0.60%)</title><rect x="45.5" y="357" width="7.1" height="15.0" fill="rgb(226,99,23)" rx="2" ry="2" />
<text x="48.55" y="367.5" ></text>
</g>
<g >
<title>add_wait_queue (82,051,342 samples, 0.67%)</title><rect x="81.3" y="405" width="7.9" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="84.28" y="415.5" ></text>
</g>
<g >
<title>heapgetpage (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="549" width="2.1" height="15.0" fill="rgb(254,225,53)" rx="2" ry="2" />
<text x="1016.44" y="559.5" ></text>
</g>
<g >
<title>ata_scsi_queuecmd (20,981,727 samples, 0.17%)</title><rect x="581.3" y="101" width="2.1" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text x="584.34" y="111.5" ></text>
</g>
<g >
<title>zap_pmd_range.isra.0 (73,446,926 samples, 0.60%)</title><rect x="45.5" y="277" width="7.1" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="48.55" y="287.5" ></text>
</g>
<g >
<title>standard_ExecutorRun (40,428,180 samples, 0.33%)</title><rect x="423.6" y="533" width="3.9" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="426.62" y="543.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (1,119,473 samples, 0.01%)</title><rect x="62.3" y="229" width="0.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="65.25" y="239.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (59,843,630 samples, 0.49%)</title><rect x="599.7" y="549" width="5.8" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="602.74" y="559.5" ></text>
</g>
<g >
<title>worker_thread (74,392,349 samples, 0.61%)</title><rect x="22.7" y="517" width="7.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="25.70" y="527.5" ></text>
</g>
<g >
<title>__writeback_single_inode (74,661,841 samples, 0.61%)</title><rect x="38.3" y="405" width="7.2" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="41.33" y="415.5" ></text>
</g>
<g >
<title>schedule (4,707,790 samples, 0.04%)</title><rect x="1122.6" y="309" width="0.5" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="1125.60" y="319.5" ></text>
</g>
<g >
<title>__pagevec_lru_add_fn (19,097,493 samples, 0.16%)</title><rect x="575.6" y="293" width="1.9" height="15.0" fill="rgb(244,183,43)" rx="2" ry="2" />
<text x="578.65" y="303.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (82,051,342 samples, 0.67%)</title><rect x="81.3" y="389" width="7.9" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="84.28" y="399.5" ></text>
</g>
<g >
<title>__handle_mm_fault (5,289,518 samples, 0.04%)</title><rect x="1116.3" y="341" width="0.5" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1119.29" y="351.5" ></text>
</g>
<g >
<title>skb_copy_datagram_iter (7,875,520 samples, 0.06%)</title><rect x="1119.6" y="277" width="0.7" height="15.0" fill="rgb(245,188,45)" rx="2" ry="2" />
<text x="1122.56" y="287.5" ></text>
</g>
<g >
<title>filemap_get_read_batch (40,693,412 samples, 0.33%)</title><rect x="472.7" y="389" width="4.0" height="15.0" fill="rgb(231,123,29)" rx="2" ry="2" />
<text x="475.72" y="399.5" ></text>
</g>
<g >
<title>__sched_setaffinity (45,451,795 samples, 0.37%)</title><rect x="69.5" y="373" width="4.4" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="72.47" y="383.5" ></text>
</g>
<g >
<title>__fget_light (20,149,182 samples, 0.16%)</title><rect x="437.3" y="485" width="1.9" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="440.29" y="495.5" ></text>
</g>
<g >
<title>load_elf_binary (73,446,926 samples, 0.60%)</title><rect x="45.5" y="405" width="7.1" height="15.0" fill="rgb(223,83,20)" rx="2" ry="2" />
<text x="48.55" y="415.5" ></text>
</g>
<g >
<title>__writeback_inodes_wb (74,661,841 samples, 0.61%)</title><rect x="38.3" y="437" width="7.2" height="15.0" fill="rgb(234,133,32)" rx="2" ry="2" />
<text x="41.33" y="447.5" ></text>
</g>
<g >
<title>cpuidle_enter_state (498,788,144 samples, 4.08%)</title><rect x="1130.3" y="453" width="48.2" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="1133.31" y="463.5" >cpui..</text>
</g>
<g >
<title>update_process_times (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="309" width="0.2" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="1123.49" y="319.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (1,119,473 samples, 0.01%)</title><rect x="62.3" y="277" width="0.1" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="65.25" y="287.5" ></text>
</g>
<g >
<title>handle_edge_irq (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="485" width="2.1" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1021.15" y="495.5" ></text>
</g>
<g >
<title>main (296,475,790 samples, 2.43%)</title><rect x="52.6" y="533" width="28.7" height="15.0" fill="rgb(243,179,42)" rx="2" ry="2" />
<text x="55.64" y="543.5" >main</text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (1,768,336 samples, 0.01%)</title><rect x="1132.2" y="405" width="0.2" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1135.20" y="415.5" ></text>
</g>
<g >
<title>asm_common_interrupt (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="533" width="2.1" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="1021.15" y="543.5" ></text>
</g>
<g >
<title>__rq_qos_throttle (74,661,841 samples, 0.61%)</title><rect x="38.3" y="229" width="7.2" height="15.0" fill="rgb(229,110,26)" rx="2" ry="2" />
<text x="41.33" y="239.5" ></text>
</g>
<g >
<title>prepare_task_switch (1,851,827 samples, 0.02%)</title><rect x="1181.6" y="453" width="0.2" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="1184.61" y="463.5" ></text>
</g>
<g >
<title>do_syscall_64 (67,058,175 samples, 0.55%)</title><rect x="10.0" y="501" width="6.5" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="13.00" y="511.5" ></text>
</g>
<g >
<title>update_load_avg (3,429,478 samples, 0.03%)</title><rect x="1180.4" y="373" width="0.3" height="15.0" fill="rgb(240,165,39)" rx="2" ry="2" />
<text x="1183.37" y="383.5" ></text>
</g>
<g >
<title>flush_smp_call_function_queue (9,953,266 samples, 0.08%)</title><rect x="1180.0" y="469" width="1.0" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="1183.03" y="479.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="373" width="7.1" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="1185.46" y="383.5" ></text>
</g>
<g >
<title>load_balance (71,808,717 samples, 0.59%)</title><rect x="1132.6" y="325" width="7.0" height="15.0" fill="rgb(226,96,23)" rx="2" ry="2" />
<text x="1135.63" y="335.5" ></text>
</g>
<g >
<title>do_user_addr_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="501" width="0.2" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="1130.42" y="511.5" ></text>
</g>
<g >
<title>free_unref_page_commit.constprop.0 (4,157,215 samples, 0.03%)</title><rect x="1118.8" y="213" width="0.4" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="1121.75" y="223.5" ></text>
</g>
<g >
<title>page_add_new_anon_rmap (1,409,510 samples, 0.01%)</title><rect x="1126.8" y="421" width="0.1" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text x="1129.77" y="431.5" ></text>
</g>
<g >
<title>__common_interrupt (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="501" width="2.1" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text x="1021.15" y="511.5" ></text>
</g>
<g >
<title>uncharge_batch (9,283,557 samples, 0.08%)</title><rect x="1117.9" y="213" width="0.9" height="15.0" fill="rgb(220,72,17)" rx="2" ry="2" />
<text x="1120.85" y="223.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (99,426,929 samples, 0.81%)</title><rect x="58.9" y="453" width="9.6" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="61.85" y="463.5" ></text>
</g>
<g >
<title>kworker/u288:2- (74,392,349 samples, 0.61%)</title><rect x="22.7" y="565" width="7.2" height="15.0" fill="rgb(225,95,22)" rx="2" ry="2" />
<text x="25.70" y="575.5" ></text>
</g>
<g >
<title>do_user_addr_fault (7,841,531 samples, 0.06%)</title><rect x="1116.0" y="373" width="0.8" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="1119.05" y="383.5" ></text>
</g>
<g >
<title>run_rebalance_domains (72,206,882 samples, 0.59%)</title><rect x="1132.6" y="373" width="7.0" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="1135.63" y="383.5" ></text>
</g>
<g >
<title>__cond_resched (37,186,050 samples, 0.30%)</title><rect x="70.3" y="293" width="3.6" height="15.0" fill="rgb(217,58,14)" rx="2" ry="2" />
<text x="73.27" y="303.5" ></text>
</g>
<g >
<title>worker_thread (74,661,841 samples, 0.61%)</title><rect x="38.3" y="517" width="7.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="41.33" y="527.5" ></text>
</g>
<g >
<title>tick_nohz_idle_got_tick (2,012,759 samples, 0.02%)</title><rect x="1178.5" y="437" width="0.2" height="15.0" fill="rgb(248,199,47)" rx="2" ry="2" />
<text x="1181.49" y="447.5" ></text>
</g>
<g >
<title>_nohz_idle_balance.constprop.0.isra.0 (4,493,289 samples, 0.04%)</title><rect x="1181.0" y="469" width="0.4" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1183.99" y="479.5" ></text>
</g>
<g >
<title>common_interrupt (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="517" width="2.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="1016.44" y="527.5" ></text>
</g>
<g >
<title>__mem_cgroup_charge (45,121,120 samples, 0.37%)</title><rect x="559.5" y="309" width="4.4" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="562.54" y="319.5" ></text>
</g>
<g >
<title>_perf_event_enable (82,798,934 samples, 0.68%)</title><rect x="59.6" y="357" width="8.0" height="15.0" fill="rgb(251,212,50)" rx="2" ry="2" />
<text x="62.62" y="367.5" ></text>
</g>
<g >
<title>schedule_idle (3,800,951 samples, 0.03%)</title><rect x="1181.4" y="485" width="0.4" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text x="1184.42" y="495.5" ></text>
</g>
<g >
<title>irq_exit_rcu (19,744,926 samples, 0.16%)</title><rect x="403.3" y="469" width="1.9" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="406.30" y="479.5" ></text>
</g>
<g >
<title>get_page_from_freelist (223,601,796 samples, 1.83%)</title><rect x="484.9" y="293" width="21.6" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="487.92" y="303.5" >g..</text>
</g>
<g >
<title>[unknown] (67,058,175 samples, 0.55%)</title><rect x="10.0" y="549" width="6.5" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="13.00" y="559.5" ></text>
</g>
<g >
<title>lru_cache_add (62,987,950 samples, 0.52%)</title><rect x="571.4" y="325" width="6.1" height="15.0" fill="rgb(247,196,47)" rx="2" ry="2" />
<text x="574.41" y="335.5" ></text>
</g>
<g >
<title>blk_stat_timer_fn (42,042,575 samples, 0.34%)</title><rect x="1139.6" y="325" width="4.1" height="15.0" fill="rgb(242,171,40)" rx="2" ry="2" />
<text x="1142.60" y="335.5" ></text>
</g>
<g >
<title>__handle_mm_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="469" width="0.2" height="15.0" fill="rgb(207,9,2)" rx="2" ry="2" />
<text x="1130.42" y="479.5" ></text>
</g>
<g >
<title>kthread (74,661,841 samples, 0.61%)</title><rect x="38.3" y="533" width="7.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="41.33" y="543.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (110,085,985 samples, 0.90%)</title><rect x="1106.2" y="421" width="10.6" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="1109.17" y="431.5" ></text>
</g>
<g >
<title>fpregs_assert_state_consistent (2,225,939 samples, 0.02%)</title><rect x="1123.5" y="373" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1126.47" y="383.5" ></text>
</g>
<g >
<title>swapper (619,844,044 samples, 5.07%)</title><rect x="1130.1" y="565" width="59.9" height="15.0" fill="rgb(239,158,37)" rx="2" ry="2" />
<text x="1133.13" y="575.5" >swapper</text>
</g>
<g >
<title>ret_from_fork (74,661,841 samples, 0.61%)</title><rect x="38.3" y="549" width="7.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="41.33" y="559.5" ></text>
</g>
<g >
<title>cgroup_rstat_flush_irqsafe (87,484,032 samples, 0.72%)</title><rect x="29.9" y="453" width="8.4" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="32.88" y="463.5" ></text>
</g>
<g >
<title>anon_vma_fork (63,281,290 samples, 0.52%)</title><rect x="429.4" y="405" width="6.1" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="432.37" y="415.5" ></text>
</g>
<g >
<title>tick_nohz_idle_stop_tick (9,667,506 samples, 0.08%)</title><rect x="1179.1" y="469" width="0.9" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text x="1182.09" y="479.5" ></text>
</g>
<g >
<title>blk_flush_plug_list (81,184,559 samples, 0.66%)</title><rect x="577.5" y="309" width="7.8" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="580.49" y="319.5" ></text>
</g>
<g >
<title>irq_exit_rcu (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="501" width="2.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1016.44" y="511.5" ></text>
</g>
<g >
<title>ExecScan (38,067,093 samples, 0.31%)</title><rect x="144.7" y="533" width="3.7" height="15.0" fill="rgb(237,150,36)" rx="2" ry="2" />
<text x="147.69" y="543.5" ></text>
</g>
<g >
<title>record__mmap_read_evlist.constprop.0 (74,457,327 samples, 0.61%)</title><rect x="74.1" y="485" width="7.2" height="15.0" fill="rgb(213,40,9)" rx="2" ry="2" />
<text x="77.09" y="495.5" ></text>
</g>
<g >
<title>do_poll.constprop.0 (84,622,241 samples, 0.69%)</title><rect x="96.5" y="437" width="8.2" height="15.0" fill="rgb(236,145,34)" rx="2" ry="2" />
<text x="99.53" y="447.5" ></text>
</g>
<g >
<title>tick_nohz_restart (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="405" width="0.1" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text x="1192.88" y="415.5" ></text>
</g>
<g >
<title>charge_memcg (7,189,802 samples, 0.06%)</title><rect x="1125.4" y="405" width="0.7" height="15.0" fill="rgb(242,170,40)" rx="2" ry="2" />
<text x="1128.41" y="415.5" ></text>
</g>
<g >
<title>blk_mq_submit_bio (74,661,841 samples, 0.61%)</title><rect x="38.3" y="245" width="7.2" height="15.0" fill="rgb(224,89,21)" rx="2" ry="2" />
<text x="41.33" y="255.5" ></text>
</g>
<g >
<title>__put_page (17,226,814 samples, 0.14%)</title><rect x="1117.5" y="261" width="1.7" height="15.0" fill="rgb(225,96,22)" rx="2" ry="2" />
<text x="1120.49" y="271.5" ></text>
</g>
<g >
<title>intel_pmu_enable_all (11,595,991 samples, 0.09%)</title><rect x="66.4" y="213" width="1.1" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text x="69.43" y="223.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="469" width="25.8" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="1100.93" y="479.5" >[..</text>
</g>
<g >
<title>__x64_sys_sched_setaffinity (1,172,608 samples, 0.01%)</title><rect x="69.2" y="421" width="0.2" height="15.0" fill="rgb(227,104,25)" rx="2" ry="2" />
<text x="72.24" y="431.5" ></text>
</g>
<g >
<title>mem_cgroup_charge_statistics.constprop.0 (1,281,512 samples, 0.01%)</title><rect x="1125.6" y="389" width="0.1" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="1128.62" y="399.5" ></text>
</g>
<g >
<title>unix_stream_read_generic (30,907,983 samples, 0.25%)</title><rect x="1117.3" y="309" width="3.0" height="15.0" fill="rgb(244,180,43)" rx="2" ry="2" />
<text x="1120.34" y="319.5" ></text>
</g>
<g >
<title>do_syscall_64 (82,051,342 samples, 0.67%)</title><rect x="81.3" y="501" width="7.9" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="84.28" y="511.5" ></text>
</g>
<g >
<title>sched_ttwu_pending (9,953,266 samples, 0.08%)</title><rect x="1180.0" y="453" width="1.0" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="1183.03" y="463.5" ></text>
</g>
<g >
<title>__mem_cgroup_flush_stats (87,484,032 samples, 0.72%)</title><rect x="29.9" y="469" width="8.4" height="15.0" fill="rgb(208,16,3)" rx="2" ry="2" />
<text x="32.88" y="479.5" ></text>
</g>
<g >
<title>vfs_read (22,222,357 samples, 0.18%)</title><rect x="597.6" y="501" width="2.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="600.60" y="511.5" ></text>
</g>
<g >
<title>vfs_read (67,058,175 samples, 0.55%)</title><rect x="10.0" y="453" width="6.5" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="13.00" y="463.5" ></text>
</g>
<g >
<title>read_pages (147,836,340 samples, 1.21%)</title><rect x="577.5" y="341" width="14.3" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="580.49" y="351.5" ></text>
</g>
<g >
<title>byteain (4,144,652,616 samples, 33.93%)</title><rect x="609.4" y="549" width="400.3" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="612.36" y="559.5" >byteain</text>
</g>
<g >
<title>blk_mq_flush_plug_list (81,184,559 samples, 0.66%)</title><rect x="577.5" y="293" width="7.8" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="580.49" y="303.5" ></text>
</g>
<g >
<title>exc_page_fault (7,841,531 samples, 0.06%)</title><rect x="1116.0" y="389" width="0.8" height="15.0" fill="rgb(212,35,8)" rx="2" ry="2" />
<text x="1119.05" y="399.5" ></text>
</g>
<g >
<title>xas_find_conflict (40,940,264 samples, 0.34%)</title><rect x="563.9" y="309" width="4.0" height="15.0" fill="rgb(206,8,1)" rx="2" ry="2" />
<text x="566.90" y="319.5" ></text>
</g>
<g >
<title>pq_sendcountedtext (19,799,567 samples, 0.16%)</title><rect x="407.5" y="517" width="1.9" height="15.0" fill="rgb(247,194,46)" rx="2" ry="2" />
<text x="410.53" y="527.5" ></text>
</g>
<g >
<title>__libc_start_call_main (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="549" width="25.8" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="1100.93" y="559.5" >_..</text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (161,526,203 samples, 1.32%)</title><rect x="1132.2" y="421" width="15.6" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1135.20" y="431.5" ></text>
</g>
<g >
<title>__softirqentry_text_start (158,193,906 samples, 1.29%)</title><rect x="1132.5" y="389" width="15.3" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1135.52" y="399.5" ></text>
</g>
<g >
<title>perf_event_for_each_child (84,328,208 samples, 0.69%)</title><rect x="59.5" y="373" width="8.1" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="62.47" y="383.5" ></text>
</g>
<g >
<title>scsi_finish_command (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="421" width="2.1" height="15.0" fill="rgb(243,178,42)" rx="2" ry="2" />
<text x="1016.44" y="431.5" ></text>
</g>
<g >
<title>run_rebalance_domains (19,744,926 samples, 0.16%)</title><rect x="403.3" y="437" width="1.9" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="406.30" y="447.5" ></text>
</g>
<g >
<title>update_process_times (1,119,473 samples, 0.01%)</title><rect x="62.3" y="181" width="0.1" height="15.0" fill="rgb(250,209,50)" rx="2" ry="2" />
<text x="65.25" y="191.5" ></text>
</g>
<g >
<title>__mem_cgroup_uncharge (13,069,599 samples, 0.11%)</title><rect x="1117.5" y="229" width="1.3" height="15.0" fill="rgb(251,215,51)" rx="2" ry="2" />
<text x="1120.49" y="239.5" ></text>
</g>
<g >
<title>ttwu_do_activate (9,953,266 samples, 0.08%)</title><rect x="1180.0" y="437" width="1.0" height="15.0" fill="rgb(215,48,11)" rx="2" ry="2" />
<text x="1183.03" y="447.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath (64,385,252 samples, 0.53%)</title><rect x="16.5" y="229" width="6.2" height="15.0" fill="rgb(238,153,36)" rx="2" ry="2" />
<text x="19.48" y="239.5" ></text>
</g>
<g >
<title>touch_atime (20,295,239 samples, 0.17%)</title><rect x="591.8" y="421" width="1.9" height="15.0" fill="rgb(205,2,0)" rx="2" ry="2" />
<text x="594.77" y="431.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (82,517,600 samples, 0.68%)</title><rect x="409.4" y="533" width="8.0" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="412.44" y="543.5" ></text>
</g>
<g >
<title>__x64_sys_poll (82,051,342 samples, 0.67%)</title><rect x="81.3" y="485" width="7.9" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="84.28" y="495.5" ></text>
</g>
<g >
<title>_raw_spin_lock (2,670,030 samples, 0.02%)</title><rect x="1178.8" y="405" width="0.3" height="15.0" fill="rgb(239,160,38)" rx="2" ry="2" />
<text x="1181.81" y="415.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (39,667,816 samples, 0.32%)</title><rect x="605.5" y="533" width="3.9" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="608.53" y="543.5" ></text>
</g>
<g >
<title>__errno_location (1,608,855 samples, 0.01%)</title><rect x="1116.8" y="421" width="0.2" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="1119.80" y="431.5" ></text>
</g>
<g >
<title>page_cache_async_ra (1,191,779,499 samples, 9.76%)</title><rect x="476.7" y="389" width="115.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="479.65" y="399.5" >page_cache_asy..</text>
</g>
<g >
<title>pud_val (3,512,982 samples, 0.03%)</title><rect x="1126.9" y="469" width="0.3" height="15.0" fill="rgb(238,151,36)" rx="2" ry="2" />
<text x="1129.91" y="479.5" ></text>
</g>
<g >
<title>handle_irq_event (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="373" width="1.9" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="1133.35" y="383.5" ></text>
</g>
<g >
<title>__hrtimer_run_queues (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="357" width="0.2" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="1123.49" y="367.5" ></text>
</g>
<g >
<title>sched_clock_idle_sleep_event (1,411,083 samples, 0.01%)</title><rect x="1181.8" y="469" width="0.1" height="15.0" fill="rgb(250,210,50)" rx="2" ry="2" />
<text x="1184.79" y="479.5" ></text>
</g>
<g >
<title>__page_cache_alloc (223,601,796 samples, 1.83%)</title><rect x="484.9" y="341" width="21.6" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="487.92" y="351.5" >_..</text>
</g>
<g >
<title>process_one_work (74,661,841 samples, 0.61%)</title><rect x="38.3" y="501" width="7.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="41.33" y="511.5" ></text>
</g>
<g >
<title>__sys_sendto (39,667,816 samples, 0.32%)</title><rect x="605.5" y="485" width="3.9" height="15.0" fill="rgb(236,146,34)" rx="2" ry="2" />
<text x="608.53" y="495.5" ></text>
</g>
<g >
<title>ata_qc_complete (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="277" width="1.9" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="1133.35" y="287.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (151,999,622 samples, 1.24%)</title><rect x="96.5" y="501" width="14.7" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="99.53" y="511.5" ></text>
</g>
<g >
<title>rb_insert_color (2,229,102 samples, 0.02%)</title><rect x="1179.1" y="389" width="0.2" height="15.0" fill="rgb(238,156,37)" rx="2" ry="2" />
<text x="1182.09" y="399.5" ></text>
</g>
<g >
<title>alloc_pages_vma (3,419,088 samples, 0.03%)</title><rect x="1116.5" y="293" width="0.3" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="1119.47" y="303.5" ></text>
</g>
<g >
<title>alloc_cpumask_var (1,206,176 samples, 0.01%)</title><rect x="69.4" y="389" width="0.1" height="15.0" fill="rgb(248,200,48)" rx="2" ry="2" />
<text x="72.36" y="399.5" ></text>
</g>
<g >
<title>__mem_cgroup_charge (10,973,383 samples, 0.09%)</title><rect x="1125.3" y="421" width="1.0" height="15.0" fill="rgb(218,60,14)" rx="2" ry="2" />
<text x="1128.26" y="431.5" ></text>
</g>
<g >
<title>start_kernel (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="501" width="7.5" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="1185.46" y="511.5" ></text>
</g>
<g >
<title>cgroup_rstat_flush_locked (87,484,032 samples, 0.72%)</title><rect x="29.9" y="437" width="8.4" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="32.88" y="447.5" ></text>
</g>
<g >
<title>switch_mm_irqs_off (3,723,054 samples, 0.03%)</title><rect x="1147.8" y="421" width="0.4" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1150.80" y="431.5" ></text>
</g>
<g >
<title>hash_search_with_hash_value (17,364,509 samples, 0.14%)</title><rect x="1011.8" y="549" width="1.6" height="15.0" fill="rgb(249,205,49)" rx="2" ry="2" />
<text x="1014.77" y="559.5" ></text>
</g>
<g >
<title>__mod_lruvec_state (21,080,461 samples, 0.17%)</title><rect x="573.6" y="293" width="2.0" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="576.61" y="303.5" ></text>
</g>
<g >
<title>__evlist__enable.constprop.0 (155,371,669 samples, 1.27%)</title><rect x="58.9" y="485" width="15.0" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="61.85" y="495.5" ></text>
</g>
<g >
<title>[unknown] (3,096,818,329 samples, 25.35%)</title><rect x="130.2" y="549" width="299.2" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="133.23" y="559.5" >[unknown]</text>
</g>
<g >
<title>do_user_addr_fault (26,647,528 samples, 0.22%)</title><rect x="1124.7" y="501" width="2.5" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="1127.67" y="511.5" ></text>
</g>
<g >
<title>skb_release_head_state (1,936,961 samples, 0.02%)</title><rect x="1119.4" y="277" width="0.2" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="1122.38" y="287.5" ></text>
</g>
<g >
<title>enqueue_task (9,953,266 samples, 0.08%)</title><rect x="1180.0" y="421" width="1.0" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="1183.03" y="431.5" ></text>
</g>
<g >
<title>do_syscall_64 (22,032,287 samples, 0.18%)</title><rect x="1121.6" y="405" width="2.1" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="1124.56" y="415.5" ></text>
</g>
<g >
<title>[psql] (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="533" width="25.8" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="1100.93" y="543.5" >[..</text>
</g>
<g >
<title>alloc_pages (223,601,796 samples, 1.83%)</title><rect x="484.9" y="325" width="21.6" height="15.0" fill="rgb(210,25,6)" rx="2" ry="2" />
<text x="487.92" y="335.5" >a..</text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="341" width="7.1" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1185.46" y="351.5" ></text>
</g>
<g >
<title>clear_page_erms (1,500,491 samples, 0.01%)</title><rect x="1116.7" y="245" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="1119.66" y="255.5" ></text>
</g>
<g >
<title>asm_exc_page_fault (1,547,083 samples, 0.01%)</title><rect x="1097.8" y="517" width="0.1" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1100.78" y="527.5" ></text>
</g>
<g >
<title>ondemand_readahead (1,191,779,499 samples, 9.76%)</title><rect x="476.7" y="373" width="115.1" height="15.0" fill="rgb(251,213,51)" rx="2" ry="2" />
<text x="479.65" y="383.5" >ondemand_reada..</text>
</g>
<g >
<title>evlist_cpu_iterator__next (64,324,564 samples, 0.53%)</title><rect x="52.6" y="469" width="6.3" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="55.64" y="479.5" ></text>
</g>
<g >
<title>ahci_handle_port_intr (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="325" width="1.9" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text x="1133.35" y="335.5" ></text>
</g>
<g >
<title>tick_sched_timer (1,119,473 samples, 0.01%)</title><rect x="62.3" y="213" width="0.1" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="65.25" y="223.5" ></text>
</g>
<g >
<title>psi_task_switch (24,729,624 samples, 0.20%)</title><rect x="71.5" y="261" width="2.4" height="15.0" fill="rgb(230,118,28)" rx="2" ry="2" />
<text x="74.47" y="271.5" ></text>
</g>
<g >
<title>clear_page_erms (84,511,679 samples, 0.69%)</title><rect x="484.9" y="277" width="8.2" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="487.92" y="287.5" ></text>
</g>
<g >
<title>mpage_process_page_bufs (74,661,841 samples, 0.61%)</title><rect x="38.3" y="341" width="7.2" height="15.0" fill="rgb(237,147,35)" rx="2" ry="2" />
<text x="41.33" y="351.5" ></text>
</g>
<g >
<title>perf_ioctl (92,563,920 samples, 0.76%)</title><rect x="59.2" y="405" width="9.0" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text x="62.21" y="415.5" ></text>
</g>
<g >
<title>psi_task_switch (25,681,033 samples, 0.21%)</title><rect x="1127.6" y="469" width="2.5" height="15.0" fill="rgb(230,118,28)" rx="2" ry="2" />
<text x="1130.64" y="479.5" ></text>
</g>
<g >
<title>[unknown] (67,058,175 samples, 0.55%)</title><rect x="10.0" y="533" width="6.5" height="15.0" fill="rgb(253,223,53)" rx="2" ry="2" />
<text x="13.00" y="543.5" ></text>
</g>
<g >
<title>kworker/u289:3- (87,484,032 samples, 0.72%)</title><rect x="29.9" y="565" width="8.4" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="32.88" y="575.5" ></text>
</g>
<g >
<title>do_sys_poll (151,999,622 samples, 1.24%)</title><rect x="96.5" y="453" width="14.7" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="99.53" y="463.5" ></text>
</g>
<g >
<title>alloc_pages_vma (3,336,288 samples, 0.03%)</title><rect x="1126.3" y="421" width="0.3" height="15.0" fill="rgb(253,224,53)" rx="2" ry="2" />
<text x="1129.32" y="431.5" ></text>
</g>
<g >
<title>security_socket_recvmsg (2,370,779 samples, 0.02%)</title><rect x="1117.1" y="341" width="0.2" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text x="1120.11" y="351.5" ></text>
</g>
<g >
<title>blk_mq_run_hw_queue (64,385,252 samples, 0.53%)</title><rect x="16.5" y="453" width="6.2" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="19.48" y="463.5" ></text>
</g>
<g >
<title>unmap_page_range (73,446,926 samples, 0.60%)</title><rect x="45.5" y="293" width="7.1" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="48.55" y="303.5" ></text>
</g>
<g >
<title>UnpinBufferNoOwner (18,425,308 samples, 0.15%)</title><rect x="150.2" y="533" width="1.8" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text x="153.20" y="543.5" ></text>
</g>
<g >
<title>__schedule (3,800,951 samples, 0.03%)</title><rect x="1181.4" y="469" width="0.4" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="1184.42" y="479.5" ></text>
</g>
<g >
<title>pglz_decompress (645,305,044 samples, 5.28%)</title><rect x="1029.5" y="549" width="62.4" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="1032.52" y="559.5" >pglz_d..</text>
</g>
<g >
<title>do_anonymous_page (3,419,088 samples, 0.03%)</title><rect x="1116.5" y="309" width="0.3" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="1119.47" y="319.5" ></text>
</g>
<g >
<title>hrtimer_cancel (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="453" width="0.2" height="15.0" fill="rgb(254,228,54)" rx="2" ry="2" />
<text x="1185.08" y="463.5" ></text>
</g>
<g >
<title>FileReadV (24,009,024 samples, 0.20%)</title><rect x="124.0" y="549" width="2.3" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text x="127.03" y="559.5" ></text>
</g>
<g >
<title>enlargeStringInfo (24,258,203 samples, 0.20%)</title><rect x="162.7" y="517" width="2.3" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="165.68" y="527.5" ></text>
</g>
<g >
<title>__ata_scsi_queuecmd (20,981,727 samples, 0.17%)</title><rect x="581.3" y="85" width="2.1" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text x="584.34" y="95.5" ></text>
</g>
<g >
<title>generic_file_read_iter (1,578,471,971 samples, 12.92%)</title><rect x="441.3" y="437" width="152.4" height="15.0" fill="rgb(222,78,18)" rx="2" ry="2" />
<text x="444.26" y="447.5" >generic_file_read_i..</text>
</g>
<g >
<title>kthread (87,484,032 samples, 0.72%)</title><rect x="29.9" y="533" width="8.4" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="32.88" y="543.5" ></text>
</g>
<g >
<title>common_interrupt (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="421" width="1.9" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="1133.35" y="431.5" ></text>
</g>
<g >
<title>ext4_mpage_readpages (66,651,781 samples, 0.55%)</title><rect x="585.3" y="309" width="6.5" height="15.0" fill="rgb(228,110,26)" rx="2" ry="2" />
<text x="588.33" y="319.5" ></text>
</g>
<g >
<title>ktime_get (1,563,961 samples, 0.01%)</title><rect x="1132.4" y="373" width="0.1" height="15.0" fill="rgb(207,10,2)" rx="2" ry="2" />
<text x="1135.37" y="383.5" ></text>
</g>
<g >
<title>scsi_end_request (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="389" width="2.1" height="15.0" fill="rgb(254,226,54)" rx="2" ry="2" />
<text x="1016.44" y="399.5" ></text>
</g>
<g >
<title>ixgbe_read_reg (74,392,349 samples, 0.61%)</title><rect x="22.7" y="453" width="7.2" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="25.70" y="463.5" ></text>
</g>
<g >
<title>unix_stream_read_actor (7,875,520 samples, 0.06%)</title><rect x="1119.6" y="293" width="0.7" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="1122.56" y="303.5" ></text>
</g>
<g >
<title>ahci_single_level_irq_intr (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="437" width="2.1" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text x="1021.15" y="447.5" ></text>
</g>
<g >
<title>__blk_mq_delay_run_hw_queue (64,385,252 samples, 0.53%)</title><rect x="16.5" y="437" width="6.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="19.48" y="447.5" ></text>
</g>
<g >
<title>consume_skb (21,495,742 samples, 0.18%)</title><rect x="1117.5" y="293" width="2.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="1120.49" y="303.5" ></text>
</g>
<g >
<title>handle_pte_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="453" width="0.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="1130.42" y="463.5" ></text>
</g>
<g >
<title>cpu_startup_entry (541,804,420 samples, 4.44%)</title><rect x="1130.1" y="517" width="52.4" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text x="1133.13" y="527.5" >cpu_s..</text>
</g>
<g >
<title>nohz_run_idle_balance (4,493,289 samples, 0.04%)</title><rect x="1181.0" y="485" width="0.4" height="15.0" fill="rgb(221,74,17)" rx="2" ry="2" />
<text x="1183.99" y="495.5" ></text>
</g>
<g >
<title>__run_timers.part.0 (84,901,418 samples, 0.70%)</title><rect x="1139.6" y="357" width="8.2" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text x="1142.60" y="367.5" ></text>
</g>
<g >
<title>syscall_return_via_sysret (1,168,523 samples, 0.01%)</title><rect x="68.5" y="453" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="71.46" y="463.5" ></text>
</g>
<g >
<title>__sysvec_apic_timer_interrupt (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="389" width="0.2" height="15.0" fill="rgb(242,173,41)" rx="2" ry="2" />
<text x="1123.49" y="399.5" ></text>
</g>
<g >
<title>pick_next_task (12,456,426 samples, 0.10%)</title><rect x="70.3" y="261" width="1.2" height="15.0" fill="rgb(206,4,1)" rx="2" ry="2" />
<text x="73.27" y="271.5" ></text>
</g>
<g >
<title>lock_timer_base (64,385,252 samples, 0.53%)</title><rect x="16.5" y="261" width="6.2" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="19.48" y="271.5" ></text>
</g>
<g >
<title>perf_mmap__push (74,457,327 samples, 0.61%)</title><rect x="74.1" y="469" width="7.2" height="15.0" fill="rgb(248,201,48)" rx="2" ry="2" />
<text x="77.09" y="479.5" ></text>
</g>
<g >
<title>dup_mmap (63,281,290 samples, 0.52%)</title><rect x="429.4" y="421" width="6.1" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="432.37" y="431.5" ></text>
</g>
<g >
<title>seq_printf (67,058,175 samples, 0.55%)</title><rect x="10.0" y="373" width="6.5" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="13.00" y="383.5" ></text>
</g>
<g >
<title>kworker/117:1H- (64,385,252 samples, 0.53%)</title><rect x="16.5" y="565" width="6.2" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="19.48" y="575.5" ></text>
</g>
<g >
<title>__blk_mq_run_hw_queue (64,385,252 samples, 0.53%)</title><rect x="16.5" y="421" width="6.2" height="15.0" fill="rgb(235,140,33)" rx="2" ry="2" />
<text x="19.48" y="431.5" ></text>
</g>
<g >
<title>__blk_mq_do_dispatch_sched (60,273,008 samples, 0.49%)</title><rect x="579.5" y="165" width="5.8" height="15.0" fill="rgb(231,122,29)" rx="2" ry="2" />
<text x="582.51" y="175.5" ></text>
</g>
<g >
<title>sched_clock_cpu (1,393,531 samples, 0.01%)</title><rect x="1178.4" y="437" width="0.1" height="15.0" fill="rgb(229,113,27)" rx="2" ry="2" />
<text x="1181.35" y="447.5" ></text>
</g>
<g >
<title>__do_sys_clone (63,281,290 samples, 0.52%)</title><rect x="429.4" y="485" width="6.1" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text x="432.37" y="495.5" ></text>
</g>
<g >
<title>rcu_gp_kthread (26,186,933 samples, 0.21%)</title><rect x="1127.6" y="517" width="2.5" height="15.0" fill="rgb(241,168,40)" rx="2" ry="2" />
<text x="1130.60" y="527.5" ></text>
</g>
<g >
<title>ata_tf_to_fis (20,981,727 samples, 0.17%)</title><rect x="581.3" y="37" width="2.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="584.34" y="47.5" ></text>
</g>
<g >
<title>kthread (64,385,252 samples, 0.53%)</title><rect x="16.5" y="533" width="6.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="19.48" y="543.5" ></text>
</g>
<g >
<title>all (12,215,775,337 samples, 100%)</title><rect x="10.0" y="581" width="1180.0" height="15.0" fill="rgb(213,39,9)" rx="2" ry="2" />
<text x="13.00" y="591.5" ></text>
</g>
<g >
<title>native_sched_clock (1,411,083 samples, 0.01%)</title><rect x="1181.8" y="421" width="0.1" height="15.0" fill="rgb(218,64,15)" rx="2" ry="2" />
<text x="1184.79" y="431.5" ></text>
</g>
<g >
<title>propagate_protected_usage (2,265,626 samples, 0.02%)</title><rect x="1118.5" y="165" width="0.3" height="15.0" fill="rgb(206,5,1)" rx="2" ry="2" />
<text x="1121.53" y="175.5" ></text>
</g>
<g >
<title>do_syscall_64 (39,667,816 samples, 0.32%)</title><rect x="605.5" y="517" width="3.9" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="608.53" y="527.5" ></text>
</g>
<g >
<title>__pagevec_lru_add (40,177,954 samples, 0.33%)</title><rect x="573.6" y="309" width="3.9" height="15.0" fill="rgb(228,106,25)" rx="2" ry="2" />
<text x="576.61" y="319.5" ></text>
</g>
<g >
<title>enqueue_task_fair (8,385,341 samples, 0.07%)</title><rect x="1180.0" y="405" width="0.8" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text x="1183.03" y="415.5" ></text>
</g>
<g >
<title>do_syscall_64 (46,657,971 samples, 0.38%)</title><rect x="69.4" y="421" width="4.5" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="72.36" y="431.5" ></text>
</g>
<g >
<title>__blk_mq_alloc_request (22,302,984 samples, 0.18%)</title><rect x="589.6" y="229" width="2.2" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="592.62" y="239.5" ></text>
</g>
<g >
<title>blk_mq_run_hw_queue (81,184,559 samples, 0.66%)</title><rect x="577.5" y="261" width="7.8" height="15.0" fill="rgb(212,36,8)" rx="2" ry="2" />
<text x="580.49" y="271.5" ></text>
</g>
<g >
<title>memchr_inv (2,293,990 samples, 0.02%)</title><rect x="1179.6" y="421" width="0.3" height="15.0" fill="rgb(240,162,38)" rx="2" ry="2" />
<text x="1182.63" y="431.5" ></text>
</g>
<g >
<title>writeback_sb_inodes (74,661,841 samples, 0.61%)</title><rect x="38.3" y="421" width="7.2" height="15.0" fill="rgb(237,148,35)" rx="2" ry="2" />
<text x="41.33" y="431.5" ></text>
</g>
<g >
<title>__alloc_pages (223,601,796 samples, 1.83%)</title><rect x="484.9" y="309" width="21.6" height="15.0" fill="rgb(233,129,30)" rx="2" ry="2" />
<text x="487.92" y="319.5" >_..</text>
</g>
<g >
<title>psi_group_change (1,591,709 samples, 0.01%)</title><rect x="1122.9" y="261" width="0.2" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="1125.90" y="271.5" ></text>
</g>
<g >
<title>__libc_start_call_main (296,475,790 samples, 2.43%)</title><rect x="52.6" y="549" width="28.7" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="55.64" y="559.5" >__..</text>
</g>
<g >
<title>intel_pmu_enable_all (2,812,603 samples, 0.02%)</title><rect x="65.9" y="229" width="0.2" height="15.0" fill="rgb(205,4,1)" rx="2" ry="2" />
<text x="68.88" y="239.5" ></text>
</g>
<g >
<title>sock_alloc_send_pskb (39,667,816 samples, 0.32%)</title><rect x="605.5" y="437" width="3.9" height="15.0" fill="rgb(210,26,6)" rx="2" ry="2" />
<text x="608.53" y="447.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (47,830,579 samples, 0.39%)</title><rect x="69.2" y="437" width="4.7" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="72.24" y="447.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="325" width="7.1" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1185.46" y="335.5" ></text>
</g>
<g >
<title>load_new_mm_cr3 (1,769,721 samples, 0.01%)</title><rect x="1148.0" y="405" width="0.2" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="1150.99" y="415.5" ></text>
</g>
<g >
<title>rcu_sched (26,186,933 samples, 0.21%)</title><rect x="1127.6" y="565" width="2.5" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text x="1130.60" y="575.5" ></text>
</g>
<g >
<title>__x64_sys_pread64 (1,659,513,994 samples, 13.59%)</title><rect x="437.3" y="501" width="160.3" height="15.0" fill="rgb(211,30,7)" rx="2" ry="2" />
<text x="440.29" y="511.5" >__x64_sys_pread64</text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (22,032,287 samples, 0.18%)</title><rect x="1121.6" y="421" width="2.1" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1124.56" y="431.5" ></text>
</g>
<g >
<title>[libpq.so.5.14] (235,797,432 samples, 1.93%)</title><rect x="1097.9" y="437" width="22.8" height="15.0" fill="rgb(216,51,12)" rx="2" ry="2" />
<text x="1100.93" y="447.5" >[..</text>
</g>
<g >
<title>secondary_startup_64_no_verify (619,844,044 samples, 5.07%)</title><rect x="1130.1" y="549" width="59.9" height="15.0" fill="rgb(239,157,37)" rx="2" ry="2" />
<text x="1133.13" y="559.5" >second..</text>
</g>
<g >
<title>syscall_exit_to_user_mode (1,973,246 samples, 0.02%)</title><rect x="68.3" y="421" width="0.2" height="15.0" fill="rgb(251,211,50)" rx="2" ry="2" />
<text x="71.27" y="431.5" ></text>
</g>
<g >
<title>blk_mq_sched_insert_requests (81,184,559 samples, 0.66%)</title><rect x="577.5" y="277" width="7.8" height="15.0" fill="rgb(217,59,14)" rx="2" ry="2" />
<text x="580.49" y="287.5" ></text>
</g>
<g >
<title>process_one_work (64,385,252 samples, 0.53%)</title><rect x="16.5" y="501" width="6.2" height="15.0" fill="rgb(237,151,36)" rx="2" ry="2" />
<text x="19.48" y="511.5" ></text>
</g>
<g >
<title>ixgbe_service_task (74,392,349 samples, 0.61%)</title><rect x="22.7" y="485" width="7.2" height="15.0" fill="rgb(207,11,2)" rx="2" ry="2" />
<text x="25.70" y="495.5" ></text>
</g>
<g >
<title>schedule (25,681,033 samples, 0.21%)</title><rect x="1127.6" y="501" width="2.5" height="15.0" fill="rgb(254,229,54)" rx="2" ry="2" />
<text x="1130.64" y="511.5" ></text>
</g>
<g >
<title>kblockd_mod_delayed_work_on (64,385,252 samples, 0.53%)</title><rect x="16.5" y="341" width="6.2" height="15.0" fill="rgb(232,124,29)" rx="2" ry="2" />
<text x="19.48" y="351.5" ></text>
</g>
<g >
<title>kmalloc_slab (1,206,176 samples, 0.01%)</title><rect x="69.4" y="357" width="0.1" height="15.0" fill="rgb(225,95,22)" rx="2" ry="2" />
<text x="72.36" y="367.5" ></text>
</g>
<g >
<title>__x64_sys_ioctl (95,540,776 samples, 0.78%)</title><rect x="58.9" y="421" width="9.3" height="15.0" fill="rgb(249,204,49)" rx="2" ry="2" />
<text x="61.93" y="431.5" ></text>
</g>
<g >
<title>arch_local_irq_enable (1,565,635 samples, 0.01%)</title><rect x="1126.0" y="373" width="0.1" height="15.0" fill="rgb(252,216,51)" rx="2" ry="2" />
<text x="1128.95" y="383.5" ></text>
</g>
<g >
<title>affine_move_task (45,451,795 samples, 0.37%)</title><rect x="69.5" y="341" width="4.4" height="15.0" fill="rgb(254,227,54)" rx="2" ry="2" />
<text x="72.47" y="351.5" ></text>
</g>
<g >
<title>__x64_sys_clone (63,281,290 samples, 0.52%)</title><rect x="429.4" y="501" width="6.1" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text x="432.37" y="511.5" ></text>
</g>
<g >
<title>rq_qos_wait (74,661,841 samples, 0.61%)</title><rect x="38.3" y="213" width="7.2" height="15.0" fill="rgb(206,8,2)" rx="2" ry="2" />
<text x="41.33" y="223.5" ></text>
</g>
<g >
<title>psi_group_change (1,567,925 samples, 0.01%)</title><rect x="1180.8" y="389" width="0.2" height="15.0" fill="rgb(226,101,24)" rx="2" ry="2" />
<text x="1183.84" y="399.5" ></text>
</g>
<g >
<title>__x64_sys_poll (151,999,622 samples, 1.24%)</title><rect x="96.5" y="469" width="14.7" height="15.0" fill="rgb(245,186,44)" rx="2" ry="2" />
<text x="99.53" y="479.5" ></text>
</g>
<g >
<title>evlist_cpu_iterator__next (54,776,217 samples, 0.45%)</title><rect x="68.6" y="469" width="5.3" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="71.57" y="479.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="373" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1123.49" y="383.5" ></text>
</g>
<g >
<title>__poll (151,999,622 samples, 1.24%)</title><rect x="96.5" y="517" width="14.7" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="99.53" y="527.5" ></text>
</g>
<g >
<title>scsi_dispatch_cmd (20,981,727 samples, 0.17%)</title><rect x="581.3" y="117" width="2.1" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="584.34" y="127.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (34,851,992 samples, 0.29%)</title><rect x="1117.0" y="405" width="3.3" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="1119.96" y="415.5" ></text>
</g>
<g >
<title>page_cache_ra_unbounded (1,191,779,499 samples, 9.76%)</title><rect x="476.7" y="357" width="115.1" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="479.65" y="367.5" >page_cache_ra_..</text>
</g>
<g >
<title>anon_vma_clone (63,281,290 samples, 0.52%)</title><rect x="429.4" y="389" width="6.1" height="15.0" fill="rgb(211,27,6)" rx="2" ry="2" />
<text x="432.37" y="399.5" ></text>
</g>
<g >
<title>__fget_light (3,452,519 samples, 0.03%)</title><rect x="1122.3" y="341" width="0.3" height="15.0" fill="rgb(233,132,31)" rx="2" ry="2" />
<text x="1125.27" y="351.5" ></text>
</g>
<g >
<title>__perf_event_enable (33,912,525 samples, 0.28%)</title><rect x="64.3" y="261" width="3.2" height="15.0" fill="rgb(237,149,35)" rx="2" ry="2" />
<text x="67.27" y="271.5" ></text>
</g>
<g >
<title>irqbalance (67,058,175 samples, 0.55%)</title><rect x="10.0" y="565" width="6.5" height="15.0" fill="rgb(232,125,30)" rx="2" ry="2" />
<text x="13.00" y="575.5" ></text>
</g>
<g >
<title>rb_erase (1,768,336 samples, 0.01%)</title><rect x="1132.2" y="341" width="0.2" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="1135.20" y="351.5" ></text>
</g>
<g >
<title>blk_done_softirq (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="469" width="2.1" height="15.0" fill="rgb(225,93,22)" rx="2" ry="2" />
<text x="1016.44" y="479.5" ></text>
</g>
<g >
<title>__queue_work (42,858,843 samples, 0.35%)</title><rect x="1143.7" y="309" width="4.1" height="15.0" fill="rgb(212,34,8)" rx="2" ry="2" />
<text x="1146.66" y="319.5" ></text>
</g>
<g >
<title>submit_bio (22,302,984 samples, 0.18%)</title><rect x="589.6" y="293" width="2.2" height="15.0" fill="rgb(207,13,3)" rx="2" ry="2" />
<text x="592.62" y="303.5" ></text>
</g>
<g >
<title>xas_descend (40,940,264 samples, 0.34%)</title><rect x="563.9" y="293" width="4.0" height="15.0" fill="rgb(215,47,11)" rx="2" ry="2" />
<text x="566.90" y="303.5" ></text>
</g>
<g >
<title>handle_mm_fault (26,647,528 samples, 0.22%)</title><rect x="1124.7" y="485" width="2.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1127.67" y="495.5" ></text>
</g>
<g >
<title>psql (308,690,480 samples, 2.53%)</title><rect x="1097.8" y="565" width="29.8" height="15.0" fill="rgb(219,66,15)" rx="2" ry="2" />
<text x="1100.78" y="575.5" >psql</text>
</g>
<g >
<title>cpuidle_select (4,243,293 samples, 0.03%)</title><rect x="1178.7" y="469" width="0.4" height="15.0" fill="rgb(222,79,19)" rx="2" ry="2" />
<text x="1181.68" y="479.5" ></text>
</g>
<g >
<title>hrtimer_reprogram (1,211,724 samples, 0.01%)</title><rect x="1189.9" y="373" width="0.1" height="15.0" fill="rgb(222,80,19)" rx="2" ry="2" />
<text x="1192.88" y="383.5" ></text>
</g>
<g >
<title>sysmalloc (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="549" width="0.2" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text x="1130.42" y="559.5" ></text>
</g>
<g >
<title>__poll (24,146,265 samples, 0.20%)</title><rect x="1121.4" y="437" width="2.3" height="15.0" fill="rgb(250,208,49)" rx="2" ry="2" />
<text x="1124.35" y="447.5" ></text>
</g>
<g >
<title>begin_new_exec (73,446,926 samples, 0.60%)</title><rect x="45.5" y="389" width="7.1" height="15.0" fill="rgb(245,184,44)" rx="2" ry="2" />
<text x="48.55" y="399.5" ></text>
</g>
<g >
<title>do_writepages (74,661,841 samples, 0.61%)</title><rect x="38.3" y="389" width="7.2" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="41.33" y="399.5" ></text>
</g>
<g >
<title>wb_do_writeback (74,661,841 samples, 0.61%)</title><rect x="38.3" y="469" width="7.2" height="15.0" fill="rgb(245,187,44)" rx="2" ry="2" />
<text x="41.33" y="479.5" ></text>
</g>
<g >
<title>sched_setaffinity@@GLIBC_2.3.4 (47,830,579 samples, 0.39%)</title><rect x="69.2" y="453" width="4.7" height="15.0" fill="rgb(206,6,1)" rx="2" ry="2" />
<text x="72.24" y="463.5" ></text>
</g>
<g >
<title>schedule_hrtimeout_range (4,707,790 samples, 0.04%)</title><rect x="1122.6" y="341" width="0.5" height="15.0" fill="rgb(221,75,18)" rx="2" ry="2" />
<text x="1125.60" y="351.5" ></text>
</g>
<g >
<title>exit_to_user_mode_prepare (2,430,757 samples, 0.02%)</title><rect x="1123.2" y="373" width="0.3" height="15.0" fill="rgb(228,108,25)" rx="2" ry="2" />
<text x="1126.24" y="383.5" ></text>
</g>
<g >
<title>__submit_bio (74,661,841 samples, 0.61%)</title><rect x="38.3" y="261" width="7.2" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="41.33" y="271.5" ></text>
</g>
<g >
<title>copy_user_enhanced_fast_string (255,747,209 samples, 2.09%)</title><rect x="445.8" y="405" width="24.7" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="448.84" y="415.5" >c..</text>
</g>
<g >
<title>UnpinBufferNoOwner (20,332,682 samples, 0.17%)</title><rect x="128.3" y="549" width="1.9" height="15.0" fill="rgb(253,221,53)" rx="2" ry="2" />
<text x="131.27" y="559.5" ></text>
</g>
<g >
<title>[psql] (266,677,017 samples, 2.18%)</title><rect x="1097.9" y="485" width="25.8" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="1100.93" y="495.5" >[..</text>
</g>
<g >
<title>__blk_mq_delay_run_hw_queue (64,385,252 samples, 0.53%)</title><rect x="16.5" y="357" width="6.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="19.48" y="367.5" ></text>
</g>
<g >
<title>record__pushfn (75,847,424 samples, 0.62%)</title><rect x="89.2" y="501" width="7.3" height="15.0" fill="rgb(244,182,43)" rx="2" ry="2" />
<text x="92.20" y="511.5" ></text>
</g>
<g >
<title>handle_irq_event (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="469" width="2.1" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="1021.15" y="479.5" ></text>
</g>
<g >
<title>__GI___ioctl (100,595,452 samples, 0.82%)</title><rect x="58.9" y="469" width="9.7" height="15.0" fill="rgb(232,128,30)" rx="2" ry="2" />
<text x="61.85" y="479.5" ></text>
</g>
<g >
<title>ret_from_fork (74,392,349 samples, 0.61%)</title><rect x="22.7" y="549" width="7.2" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="25.70" y="559.5" ></text>
</g>
<g >
<title>vfs_read (1,618,487,483 samples, 13.25%)</title><rect x="441.3" y="485" width="156.3" height="15.0" fill="rgb(224,88,21)" rx="2" ry="2" />
<text x="444.26" y="495.5" >vfs_read</text>
</g>
<g >
<title>__libc_recv (34,851,992 samples, 0.29%)</title><rect x="1117.0" y="421" width="3.3" height="15.0" fill="rgb(233,129,31)" rx="2" ry="2" />
<text x="1119.96" y="431.5" ></text>
</g>
<g >
<title>ahci_handle_port_interrupt (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="309" width="1.9" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="1133.35" y="319.5" ></text>
</g>
<g >
<title>xas_nomem (16,812,206 samples, 0.14%)</title><rect x="567.9" y="309" width="1.6" height="15.0" fill="rgb(216,54,13)" rx="2" ry="2" />
<text x="570.85" y="319.5" ></text>
</g>
<g >
<title>evlist_cpu_iterator__next (2,322,230 samples, 0.02%)</title><rect x="73.9" y="485" width="0.2" height="15.0" fill="rgb(223,85,20)" rx="2" ry="2" />
<text x="76.86" y="495.5" ></text>
</g>
<g >
<title>_find_next_bit (71,808,717 samples, 0.59%)</title><rect x="1132.6" y="309" width="7.0" height="15.0" fill="rgb(230,116,27)" rx="2" ry="2" />
<text x="1135.63" y="319.5" ></text>
</g>
<g >
<title>mem_cgroup_css_rstat_flush (87,484,032 samples, 0.72%)</title><rect x="29.9" y="421" width="8.4" height="15.0" fill="rgb(253,222,53)" rx="2" ry="2" />
<text x="32.88" y="431.5" ></text>
</g>
<g >
<title>find_vma (2,552,013 samples, 0.02%)</title><rect x="1116.0" y="357" width="0.3" height="15.0" fill="rgb(213,37,8)" rx="2" ry="2" />
<text x="1119.05" y="367.5" ></text>
</g>
<g >
<title>cpu_startup_entry (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="453" width="7.5" height="15.0" fill="rgb(252,220,52)" rx="2" ry="2" />
<text x="1185.46" y="463.5" ></text>
</g>
<g >
<title>scheduler_tick (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="245" width="7.1" height="15.0" fill="rgb(246,190,45)" rx="2" ry="2" />
<text x="1185.46" y="255.5" ></text>
</g>
<g >
<title>tick_nohz_restart (3,971,096 samples, 0.03%)</title><rect x="1182.1" y="469" width="0.4" height="15.0" fill="rgb(246,191,45)" rx="2" ry="2" />
<text x="1185.08" y="479.5" ></text>
</g>
<g >
<title>pg_popcount (2,486,527,712 samples, 20.36%)</title><rect x="165.0" y="517" width="240.2" height="15.0" fill="rgb(224,90,21)" rx="2" ry="2" />
<text x="168.02" y="527.5" >pg_popcount</text>
</g>
<g >
<title>__hrtimer_run_queues (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="309" width="7.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="1185.46" y="319.5" ></text>
</g>
<g >
<title>event_function (44,807,491 samples, 0.37%)</title><rect x="63.2" y="277" width="4.3" height="15.0" fill="rgb(252,219,52)" rx="2" ry="2" />
<text x="66.22" y="287.5" ></text>
</g>
<g >
<title>exec_mmap (73,446,926 samples, 0.60%)</title><rect x="45.5" y="373" width="7.1" height="15.0" fill="rgb(221,77,18)" rx="2" ry="2" />
<text x="48.55" y="383.5" ></text>
</g>
<g >
<title>[unknown] (2,665,355,715 samples, 21.82%)</title><rect x="152.0" y="533" width="257.4" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="154.98" y="543.5" >[unknown]</text>
</g>
<g >
<title>handle_edge_irq (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="389" width="1.9" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="1133.35" y="399.5" ></text>
</g>
<g >
<title>ahci_handle_port_intr (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="421" width="2.1" height="15.0" fill="rgb(250,207,49)" rx="2" ry="2" />
<text x="1021.15" y="431.5" ></text>
</g>
<g >
<title>alloc_cpumask_var_node (1,206,176 samples, 0.01%)</title><rect x="69.4" y="373" width="0.1" height="15.0" fill="rgb(208,17,4)" rx="2" ry="2" />
<text x="72.36" y="383.5" ></text>
</g>
<g >
<title>text_to_cstring (19,137,829 samples, 0.16%)</title><rect x="427.5" y="533" width="1.9" height="15.0" fill="rgb(221,73,17)" rx="2" ry="2" />
<text x="430.53" y="543.5" ></text>
</g>
<g >
<title>tts_buffer_heap_getsomeattrs (37,819,216 samples, 0.31%)</title><rect x="1094.1" y="549" width="3.7" height="15.0" fill="rgb(222,81,19)" rx="2" ry="2" />
<text x="1097.12" y="559.5" ></text>
</g>
<g >
<title>vsnprintf (67,058,175 samples, 0.55%)</title><rect x="10.0" y="357" width="6.5" height="15.0" fill="rgb(240,164,39)" rx="2" ry="2" />
<text x="13.00" y="367.5" ></text>
</g>
<g >
<title>__memcg_kmem_charge_page (18,724,499 samples, 0.15%)</title><rect x="605.5" y="373" width="1.8" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="608.53" y="383.5" ></text>
</g>
<g >
<title>rebalance_domains (71,808,717 samples, 0.59%)</title><rect x="1132.6" y="341" width="7.0" height="15.0" fill="rgb(248,202,48)" rx="2" ry="2" />
<text x="1135.63" y="351.5" ></text>
</g>
<g >
<title>worker_thread (64,385,252 samples, 0.53%)</title><rect x="16.5" y="517" width="6.2" height="15.0" fill="rgb(214,45,10)" rx="2" ry="2" />
<text x="19.48" y="527.5" ></text>
</g>
<g >
<title>skb_free_head (2,331,967 samples, 0.02%)</title><rect x="1119.2" y="261" width="0.2" height="15.0" fill="rgb(221,77,18)" rx="2" ry="2" />
<text x="1122.15" y="271.5" ></text>
</g>
<g >
<title>mutex_lock (84,409,880 samples, 0.69%)</title><rect x="96.5" y="405" width="8.2" height="15.0" fill="rgb(217,57,13)" rx="2" ry="2" />
<text x="99.53" y="415.5" ></text>
</g>
<g >
<title>blk_stat_add (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="357" width="2.1" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="1016.44" y="367.5" ></text>
</g>
<g >
<title>blk_finish_plug (81,184,559 samples, 0.66%)</title><rect x="577.5" y="325" width="7.8" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="580.49" y="335.5" ></text>
</g>
<g >
<title>x86_64_start_kernel (78,039,624 samples, 0.64%)</title><rect x="1182.5" y="533" width="7.5" height="15.0" fill="rgb(206,7,1)" rx="2" ry="2" />
<text x="1185.46" y="543.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (1,700,456,947 samples, 13.92%)</title><rect x="435.5" y="533" width="164.2" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="438.49" y="543.5" >entry_SYSCALL_64_afte..</text>
</g>
<g >
<title>sched_clock (1,411,083 samples, 0.01%)</title><rect x="1181.8" y="437" width="0.1" height="15.0" fill="rgb(212,33,7)" rx="2" ry="2" />
<text x="1184.79" y="447.5" ></text>
</g>
<g >
<title>common_interrupt (22,118,264 samples, 0.18%)</title><rect x="1018.2" y="517" width="2.1" height="15.0" fill="rgb(215,46,11)" rx="2" ry="2" />
<text x="1021.15" y="527.5" ></text>
</g>
<g >
<title>__set_cpus_allowed_ptr_locked (45,451,795 samples, 0.37%)</title><rect x="69.5" y="357" width="4.4" height="15.0" fill="rgb(215,49,11)" rx="2" ry="2" />
<text x="72.47" y="367.5" ></text>
</g>
<g >
<title>__send (39,667,816 samples, 0.32%)</title><rect x="605.5" y="549" width="3.9" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="608.53" y="559.5" ></text>
</g>
<g >
<title>__blk_mq_sched_dispatch_requests (64,385,252 samples, 0.53%)</title><rect x="16.5" y="389" width="6.2" height="15.0" fill="rgb(232,126,30)" rx="2" ry="2" />
<text x="19.48" y="399.5" ></text>
</g>
<g >
<title>generic_exec_single (65,900,767 samples, 0.54%)</title><rect x="61.2" y="309" width="6.3" height="15.0" fill="rgb(234,134,32)" rx="2" ry="2" />
<text x="64.18" y="319.5" ></text>
</g>
<g >
<title>copy_process (63,281,290 samples, 0.52%)</title><rect x="429.4" y="453" width="6.1" height="15.0" fill="rgb(233,130,31)" rx="2" ry="2" />
<text x="432.37" y="463.5" ></text>
</g>
<g >
<title>scsi_io_completion (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="405" width="2.1" height="15.0" fill="rgb(247,196,46)" rx="2" ry="2" />
<text x="1016.44" y="415.5" ></text>
</g>
<g >
<title>__handle_irq_event_percpu (19,168,502 samples, 0.16%)</title><rect x="1130.3" y="357" width="1.9" height="15.0" fill="rgb(236,147,35)" rx="2" ry="2" />
<text x="1133.35" y="367.5" ></text>
</g>
<g >
<title>wait_for_completion (37,186,050 samples, 0.30%)</title><rect x="70.3" y="325" width="3.6" height="15.0" fill="rgb(211,28,6)" rx="2" ry="2" />
<text x="73.27" y="335.5" ></text>
</g>
<g >
<title>sysvec_apic_timer_interrupt (19,744,926 samples, 0.16%)</title><rect x="403.3" y="485" width="1.9" height="15.0" fill="rgb(220,69,16)" rx="2" ry="2" />
<text x="406.30" y="495.5" ></text>
</g>
<g >
<title>show_interrupts (67,058,175 samples, 0.55%)</title><rect x="10.0" y="389" width="6.5" height="15.0" fill="rgb(214,42,10)" rx="2" ry="2" />
<text x="13.00" y="399.5" ></text>
</g>
<g >
<title>obj_cgroup_charge_pages (18,724,499 samples, 0.15%)</title><rect x="605.5" y="357" width="1.8" height="15.0" fill="rgb(246,192,46)" rx="2" ry="2" />
<text x="608.53" y="367.5" ></text>
</g>
<g >
<title>ret_from_fork (87,484,032 samples, 0.72%)</title><rect x="29.9" y="549" width="8.4" height="15.0" fill="rgb(227,105,25)" rx="2" ry="2" />
<text x="32.88" y="559.5" ></text>
</g>
<g >
<title>__submit_bio (22,302,984 samples, 0.18%)</title><rect x="589.6" y="261" width="2.2" height="15.0" fill="rgb(230,117,28)" rx="2" ry="2" />
<text x="592.62" y="271.5" ></text>
</g>
<g >
<title>unmap_vmas (73,446,926 samples, 0.60%)</title><rect x="45.5" y="325" width="7.1" height="15.0" fill="rgb(243,176,42)" rx="2" ry="2" />
<text x="48.55" y="335.5" ></text>
</g>
<g >
<title>tick_sched_handle (2,202,851 samples, 0.02%)</title><rect x="1120.5" y="325" width="0.2" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="1123.49" y="335.5" ></text>
</g>
<g >
<title>handle_mm_fault (5,289,518 samples, 0.04%)</title><rect x="1116.3" y="357" width="0.5" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1119.29" y="367.5" ></text>
</g>
<g >
<title>update_rq_clock (74,143,072 samples, 0.61%)</title><rect x="1182.5" y="229" width="7.1" height="15.0" fill="rgb(231,119,28)" rx="2" ry="2" />
<text x="1185.46" y="239.5" ></text>
</g>
<g >
<title>blk_complete_reqs (21,218,658 samples, 0.17%)</title><rect x="1013.4" y="453" width="2.1" height="15.0" fill="rgb(216,52,12)" rx="2" ry="2" />
<text x="1016.44" y="463.5" ></text>
</g>
<g >
<title>mpage_submit_page (74,661,841 samples, 0.61%)</title><rect x="38.3" y="325" width="7.2" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="41.33" y="335.5" ></text>
</g>
<g >
<title>handle_mm_fault (1,860,090 samples, 0.02%)</title><rect x="1127.4" y="485" width="0.2" height="15.0" fill="rgb(234,135,32)" rx="2" ry="2" />
<text x="1130.42" y="495.5" ></text>
</g>
<g >
<title>dequeue_task_fair (1,794,773 samples, 0.01%)</title><rect x="1122.6" y="261" width="0.2" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="1125.60" y="271.5" ></text>
</g>
<g >
<title>__schedule (25,681,033 samples, 0.21%)</title><rect x="1127.6" y="485" width="2.5" height="15.0" fill="rgb(227,103,24)" rx="2" ry="2" />
<text x="1130.64" y="495.5" ></text>
</g>
<g >
<title>tick_sched_handle (1,119,473 samples, 0.01%)</title><rect x="62.3" y="197" width="0.1" height="15.0" fill="rgb(219,68,16)" rx="2" ry="2" />
<text x="65.25" y="207.5" ></text>
</g>
<g >
<title>do_anonymous_page (17,088,464 samples, 0.14%)</title><rect x="1125.3" y="437" width="1.6" height="15.0" fill="rgb(238,155,37)" rx="2" ry="2" />
<text x="1128.26" y="447.5" ></text>
</g>
<g >
<title>__wait_for_common (37,186,050 samples, 0.30%)</title><rect x="70.3" y="309" width="3.6" height="15.0" fill="rgb(230,119,28)" rx="2" ry="2" />
<text x="73.27" y="319.5" ></text>
</g>
<g >
<title>flush_memcg_stats_dwork (87,484,032 samples, 0.72%)</title><rect x="29.9" y="485" width="8.4" height="15.0" fill="rgb(209,19,4)" rx="2" ry="2" />
<text x="32.88" y="495.5" ></text>
</g>
<g >
<title>refresh_cpu_vm_stats (1,784,924 samples, 0.01%)</title><rect x="1179.9" y="421" width="0.1" height="15.0" fill="rgb(224,91,21)" rx="2" ry="2" />
<text x="1182.85" y="431.5" ></text>
</g>
<g >
<title>_raw_spin_lock_irqsave (64,385,252 samples, 0.53%)</title><rect x="16.5" y="245" width="6.2" height="15.0" fill="rgb(247,195,46)" rx="2" ry="2" />
<text x="19.48" y="255.5" ></text>
</g>
<g >
<title>ata_qc_issue (20,981,727 samples, 0.17%)</title><rect x="581.3" y="53" width="2.1" height="15.0" fill="rgb(241,169,40)" rx="2" ry="2" />
<text x="584.34" y="63.5" ></text>
</g>
<g >
<title>rmqueue_bulk (92,259,596 samples, 0.76%)</title><rect x="497.6" y="261" width="8.9" height="15.0" fill="rgb(235,138,33)" rx="2" ry="2" />
<text x="500.60" y="271.5" ></text>
</g>
<g >
<title>hrtimer_interrupt (1,768,336 samples, 0.01%)</title><rect x="1132.2" y="389" width="0.2" height="15.0" fill="rgb(228,109,26)" rx="2" ry="2" />
<text x="1135.20" y="399.5" ></text>
</g>
<g >
<title>do_syscall_64 (1,681,736,351 samples, 13.77%)</title><rect x="437.3" y="517" width="162.4" height="15.0" fill="rgb(209,20,4)" rx="2" ry="2" />
<text x="440.29" y="527.5" >do_syscall_64</text>
</g>
<g >
<title>get_page_from_freelist (20,943,317 samples, 0.17%)</title><rect x="607.3" y="373" width="2.1" height="15.0" fill="rgb(252,218,52)" rx="2" ry="2" />
<text x="610.33" y="383.5" ></text>
</g>
<g >
<title>native_queued_spin_lock_slowpath.part.0 (64,385,252 samples, 0.53%)</title><rect x="16.5" y="213" width="6.2" height="15.0" fill="rgb(248,198,47)" rx="2" ry="2" />
<text x="19.48" y="223.5" ></text>
</g>
<g >
<title>kthread (26,186,933 samples, 0.21%)</title><rect x="1127.6" y="533" width="2.5" height="15.0" fill="rgb(239,159,38)" rx="2" ry="2" />
<text x="1130.60" y="543.5" ></text>
</g>
<g >
<title>asm_sysvec_apic_timer_interrupt (1,119,473 samples, 0.01%)</title><rect x="62.3" y="293" width="0.1" height="15.0" fill="rgb(232,127,30)" rx="2" ry="2" />
<text x="65.25" y="303.5" ></text>
</g>
<g >
<title>enlargeStringInfo (21,222,511 samples, 0.17%)</title><rect x="1009.7" y="549" width="2.1" height="15.0" fill="rgb(237,150,35)" rx="2" ry="2" />
<text x="1012.72" y="559.5" ></text>
</g>
<g >
<title>CheckForSerializableConflictOutNeeded (26,318,112 samples, 0.22%)</title><rect x="119.4" y="549" width="2.5" height="15.0" fill="rgb(213,38,9)" rx="2" ry="2" />
<text x="122.40" y="559.5" ></text>
</g>
<g >
<title>__memmove_evex_unaligned_erms (6,733,320 samples, 0.06%)</title><rect x="1120.7" y="437" width="0.7" height="15.0" fill="rgb(226,100,24)" rx="2" ry="2" />
<text x="1123.70" y="447.5" ></text>
</g>
<g >
<title>new_sync_read (67,058,175 samples, 0.55%)</title><rect x="10.0" y="437" width="6.5" height="15.0" fill="rgb(218,61,14)" rx="2" ry="2" />
<text x="13.00" y="447.5" ></text>
</g>
<g >
<title>[unknown] (1,547,083 samples, 0.01%)</title><rect x="1097.8" y="549" width="0.1" height="15.0" fill="rgb(210,24,5)" rx="2" ry="2" />
<text x="1100.78" y="559.5" ></text>
</g>
<g >
<title>tick_program_event (1,844,927 samples, 0.02%)</title><rect x="1182.1" y="405" width="0.2" height="15.0" fill="rgb(241,166,39)" rx="2" ry="2" />
<text x="1185.08" y="415.5" ></text>
</g>
<g >
<title>entry_SYSCALL_64_after_hwframe (67,058,175 samples, 0.55%)</title><rect x="10.0" y="517" width="6.5" height="15.0" fill="rgb(218,63,15)" rx="2" ry="2" />
<text x="13.00" y="527.5" ></text>
</g>
</g>
</svg>
popcount_avx512.patchapplication/octet-stream; name=popcount_avx512.patchDownload
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..354ab636da 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use some AVX-512 Capabilities (11/01/2023)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512)
+pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512)
+pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..a0c91273ec 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -19,6 +19,10 @@
#include <intrin.h>
#endif
+#if defined(HAVE__HW_AVX512_POPCNT)
+#include <immintrin.h>
+#endif
+
#include "port/pg_bitutils.h"
@@ -298,6 +302,23 @@ pg_popcount(const char *buf, int bytes)
uint64 popcnt = 0;
#if SIZEOF_VOID_P >= 8
+#if defined(HAVE__HW_AVX512_POPCNT)
+ uint64 tmp[8] __attribute__((aligned(64)));
+ __m512i *pc_result = (__m512i *)tmp;
+ __m512i accumulator = _mm512_setzero_si512();
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+ _mm512_store_si512(pc_result, accumulator);
+ popcnt = _mm512_reduce_add_epi64(*pc_result);
+ bytes = bytes % 64;
+
+#else // HAVE__HW_AVX512_POPCNT
/* Process in 64-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(8, buf))
{
@@ -311,6 +332,7 @@ pg_popcount(const char *buf, int bytes)
buf = (const char *) words;
}
+#endif // HAVE__HW_AVX512_POPCNT
#else
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
On 2024-Jan-25, Shankaran, Akash wrote:
With the updated patch, we observed significant improvements and
handily beat the previous popcount algorithm performance. No
regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same
microbenchmark described initially in this thread.
These are great results.
However, it would be much better if the improved code were available for
all relevant builds and activated if a CPUID test determines that the
relevant instructions are available, instead of requiring a compile-time
flag -- which most builds are not going to use, thus wasting the
opportunity for running the optimized code.
I suppose this would require patching pg_popcount64_choose() to be more
specific. Looking at the existing code, I would also consider renaming
the "_fast" variants to something like pg_popcount32_asml/
pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq
or such. (Or maybe leave the 32-bit version alone as "fast/slow", since
there's no third option for that one -- or do I misread?)
I also think this needs to move the CFLAGS-decision-making elsewhere;
asking the user to get it right is too much of a burden. Is it workable
to simply verify compiler support for the additional flags needed, and
if so add them to a new CFLAGS_BITUTILS variable or such? We already
have the CFLAGS_CRC model that should be easy to follow. Should be easy
enough to mostly copy what's in configure.ac and meson.build, right?
Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family. Given that
Nathan reports [1]/messages/by-id/20231107201441.GA898662@nathanxps13 a performance decrease, maybe we should set that
thought aside for now and continue to use function pointers. It's worth
keeping in mind that popcount is already using function pointers (at
least in the case where we try to use POPCNT directly), so patching to
select between three options instead of between two wouldn't be a
regression.
[1]: /messages/by-id/20231107201441.GA898662@nathanxps13
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)
On 2024-Jan-25, Alvaro Herrera wrote:
Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family.
Oh, I just realized that the patch as currently proposed is placing the
optimized popcount code in the path that does not require going through
a function pointer. So the performance increase is probably coming from
both avoiding jumping through the pointer as well as from the improved
instruction.
This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
/messages/by-id/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net
Hi All,
However, it would be much better if the improved code were available for
all relevant builds and activated if a CPUID test determines that the
relevant instructions are available, instead of requiring a compile-time
flag -- which most builds are not going to use, thus wasting the
opportunity for running the optimized code.
This makes sense. I addressed the feedback, and am attaching an updated patch. Patch also addresses your feedback of autconf configurations by adding CFLAG support. I tested the runtime check for AVX512 on multiple processors with and without AVX512 and it detected or failed to detect the feature as expected.
Looking at the existing code, I would also consider renaming
the "_fast" variants to something like pg_popcount32_asml/
pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq
or such.
I left out the renaming, as it made sense to keep the fast/slow naming for readability.
Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family. Given that
Nathan reports [1] a performance decrease, maybe we should set that
thought aside for now and continue to use function pointers.
Since there are improvements without it (results below), I agree with you to continue using function pointers.
I collected data on machines with, and without AVX512 support, using a table with 1M rows and performing SQL bit_count() on a char column containing (84bytes, 4KiB, 8KiB, 16KiB).
* On non-AVX 512 hardware: no regression or impact at runtime with code built with AVX 512 support in the binary between the patched and unpatched servers.
* On AVX512 hardware: the max improvement I saw was 17% but was averaged closer to 6.5% on a bare-metal machine. The benefit is lower on smaller cloud VMs on AWS (1 - 3%)
If the patch looks good, please suggest next steps on committing it.
Paul
-----Original Message-----
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Sent: Thursday, January 25, 2024 1:49 AM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Amonson, Paul D <paul.d.amonson@intel.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
On 2024-Jan-25, Shankaran, Akash wrote:
With the updated patch, we observed significant improvements and
handily beat the previous popcount algorithm performance. No
regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same
microbenchmark described initially in this thread.
These are great results.
However, it would be much better if the improved code were available for all relevant builds and activated if a CPUID test determines that the relevant instructions are available, instead of requiring a compile-time flag -- which most builds are not going to use, thus wasting the opportunity for running the optimized code.
I suppose this would require patching pg_popcount64_choose() to be more specific. Looking at the existing code, I would also consider renaming the "_fast" variants to something like pg_popcount32_asml/ pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq or such. (Or maybe leave the 32-bit version alone as "fast/slow", since there's no third option for that one -- or do I misread?)
I also think this needs to move the CFLAGS-decision-making elsewhere; asking the user to get it right is too much of a burden. Is it workable to simply verify compiler support for the additional flags needed, and if so add them to a new CFLAGS_BITUTILS variable or such? We already have the CFLAGS_CRC model that should be easy to follow. Should be easy enough to mostly copy what's in configure.ac and meson.build, right?
Finally, the matter of using ifunc as proposed by Noah seems to be still in the air, with no patches offered for the popcount family. Given that Nathan reports [1]/messages/by-id/20231107201441.GA898662@nathanxps13 a performance decrease, maybe we should set that thought aside for now and continue to use function pointers. It's worth keeping in mind that popcount is already using function pointers (at least in the case where we try to use POPCNT directly), so patching to select between three options instead of between two wouldn't be a regression.
[1]: /messages/by-id/20231107201441.GA898662@nathanxps13
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)
Attachments:
Official-PostgreSQL-AVX-512-POPCNT.patchapplication/octet-stream; name=Official-PostgreSQL-AVX-512-POPCNT.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..63af1331bf 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,38 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 2a1ee251f2..96c8f39254 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -15209,7 +15210,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15255,7 +15256,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15279,7 +15280,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15324,7 +15325,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15348,7 +15349,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -17736,6 +17737,100 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 52fd7af446..d5fe701c9c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2078,6 +2078,14 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..6a01a7d89a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..7db3cd44ce 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -19,6 +19,8 @@
#include <intrin.h>
#endif
+#include <immintrin.h>
+
#include "port/pg_bitutils.h"
@@ -110,12 +112,16 @@ static int pg_popcount64_slow(uint64 word);
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static int pg_popcount512_choose(const char* buf, int bytes);
static int pg_popcount32_fast(uint32 word);
static int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount512_fast(const char* buf, int bytes);
+static uint64 pg_popcount512_slow(const char* buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
+uint64 (*pg_popcount512) (const char* buf, int bytes) = pg_popcount512_choose;
+#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -138,6 +144,36 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is available. This is
+ * simular to the method above see this URL.
+ *
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally we make sure the xgetbv result is conistent with the CPUID results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0) // Check for AVX512VPOPCNTDQ and AVX512F
+ {
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+ } /* POPCNT 512 */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -178,6 +214,19 @@ pg_popcount64_choose(uint64 word)
return pg_popcount64(word);
}
+static int
+pg_popcount512_choose(const char* buf, int bytes) {
+ if (pg_popcount512_available())
+ {
+ pg_popcount512 = pg_popcount512_fast;
+ }
+ else
+ {
+ pg_popcount512 = pg_popcount512_slow;
+ }
+ return pg_popcount512(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -212,6 +261,30 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+static uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ bytes = bytes % 64;
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char)*buf++];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -265,6 +338,29 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
+static uint64
+pg_popcount512_slow(const char* buf, int bytes) {
+ uint64 popcnt = 0;
+ if (buf == (const char *)TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -286,6 +382,13 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+static uint64 pg_popcount512_slow(const char *buf, int bytes);
+inline uint64
+pg_popcount512(const char *buf, int bytes)
+{
+ return pg_popcount512_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
@@ -298,22 +401,10 @@ pg_popcount(const char *buf, int bytes)
uint64 popcnt = 0;
#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += pg_popcount64(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
+ return pg_popcount512(buf, bytes);
#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *)TYPEALIGN(4, buf))
{
const uint32 *words = (const uint32 *) buf;
Hello,
This looks quite reasonable. On my machine, I get the compiler test to
pass so I get a "yes" in configure; but of course my CPU doesn't support
the instructions so I get the slow variant. So here's the patch again
with some minor artifacts fixed.
I have the following review notes:
1. we use __get_cpuid_count and __cpuidex by relying on macros
HAVE__GET_CPUID and HAVE__CPUID respectively; but those macros are (in
the current Postgres source) only used and tested for __get_cpuid and
__cpuid respectively. So unless there's some reason to be certain that
__get_cpuid_count is always present when __get_cpuid is present, and
that __cpuidex is present when __cpuid is present, I think we need to
add new configure tests and new HAVE_ macros for these.
2. we rely on <immintrin.h> being present with no AC_CHECK_HEADER()
test. We currently don't use this header anywhere, so I suppose we need
a test for this one as well. (Also, I suppose if we don't have
immintrin.h we can skip the rest of it?)
3. We do the __get_cpuid_count/__cpuidex test and we also do a xgetbv
test. The comment there claims that this is to check the results for
consistency. But ... how would we know that the results are ever
inconsistent? As far as I understand, if they were, we would silently
become slower. Is this really what we want? I'm confused about this
coding. Maybe we do need both tests to succeed? In that case, just
reword the comment.
I think if both tests are each considered reliable on its own, then we
could either choose one of them and stick with it, ignoring the other;
or we could use one as primary and then in a USE_ASSERT_CHECKING block
verify that the other matches and throw a WARNING if not (but what would
that tell us?). Or something like that ... not sure.
4. It needs meson support, which I suppose consists of copying the
c-compiler.m4 test into meson.build, mimicking what the tests for CRC
instructions do.
I started a CI run with this patch applied,
https://cirrus-ci.com/build/4912499619790848
but because Meson support is missing, the compile failed
immediately:
[10:08:48.825] ccache cc -Isrc/port/libpgport_srv.a.p -Isrc/include -I../src/include -Isrc/include/utils -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -pthread -DBUILDING_DLL -MD -MQ src/port/libpgport_srv.a.p/pg_bitutils.c.o -MF src/port/libpgport_srv.a.p/pg_bitutils.c.o.d -o src/port/libpgport_srv.a.p/pg_bitutils.c.o -c ../src/port/pg_bitutils.c
[10:08:48.825] ../src/port/pg_bitutils.c: In function ‘pg_popcount512_fast’:
[10:08:48.825] ../src/port/pg_bitutils.c:270:11: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~
[10:08:48.825] In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:55,
[10:08:48.825] from ../src/port/pg_bitutils.c:22:
[10:08:48.825] /usr/lib/gcc/x86_64-linux-gnu/10/include/avx512fintrin.h:339:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_setzero_si512’: target specific option mismatch
[10:08:48.825] 339 | _mm512_setzero_si512 (void)
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~
[10:08:48.825] ../src/port/pg_bitutils.c:270:25: note: called from here
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~~~
Thanks
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Attachments:
v3-0001-Add-support-for-AVX512-implemented-POPCNT.patchtext/x-diff; charset=utf-8Download
From 188cd244ffcff20e1cf0bc655106d4db1a51b55b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 6 Feb 2024 19:36:59 +0100
Subject: [PATCH v3] Add support for AVX512-implemented POPCNT
Author: Paul D Amonson <paul.d.amonson@intel.com>
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A@BL1PR11MB5304.namprd11.prod.outlook.com
---
config/c-compiler.m4 | 33 +++++++++++
configure | 91 ++++++++++++++++++++++++++++++
configure.ac | 8 +++
src/Makefile.global.in | 1 +
src/port/Makefile | 5 ++
src/port/pg_bitutils.c | 122 ++++++++++++++++++++++++++++++++++++-----
6 files changed, 247 insertions(+), 13 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..a5a3246199 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,36 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 2a1ee251f2..47d91384b9 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17736,6 +17737,96 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 52fd7af446..d5fe701c9c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2078,6 +2078,14 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..6a01a7d89a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..f8f029190f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -19,6 +19,8 @@
#include <intrin.h>
#endif
+#include <immintrin.h>
+
#include "port/pg_bitutils.h"
@@ -110,11 +112,15 @@ static int pg_popcount64_slow(uint64 word);
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount512_choose(const char *buf, int bytes);
static int pg_popcount32_fast(uint32 word);
static int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount512_fast(const char *buf, int bytes);
+static uint64 pg_popcount512_slow(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount512) (const char *buf, int bytes) = pg_popcount512_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -138,6 +144,40 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__("xgetbv\t\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xE0) != 0;
+ } /* POPCNT 512 */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -178,6 +218,17 @@ pg_popcount64_choose(uint64 word)
return pg_popcount64(word);
}
+static uint64
+pg_popcount512_choose(const char *buf, int bytes)
+{
+ if (pg_popcount512_available())
+ pg_popcount512 = pg_popcount512_fast;
+ else
+ pg_popcount512 = pg_popcount512_slow;
+
+ return pg_popcount512(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -212,6 +263,32 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+static uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ bytes = bytes % 64;
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -265,6 +342,31 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
+static uint64
+pg_popcount512_slow(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -286,6 +388,12 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount512(const char *buf, int bytes)
+{
+ return pg_popcount512_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
@@ -298,19 +406,7 @@ pg_popcount(const char *buf, int bytes)
uint64 popcnt = 0;
#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += pg_popcount64(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
+ return pg_popcount512(buf, bytes);
#else
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
--
2.39.2
I happened to notice by chance that John Naylor had posted an extension
to measure performance of popcount here:
/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
This might be useful as a base for a new one to verify the results of
the proposed patch in machines with relevant instruction support.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"We're here to devour each other alive" (Hobbes)
Álvaro,
All feedback is now completed. I added the additional checks for the new APIs and a separate check for the header to autoconf.
About the double check for AVX 512 I added a large comment explaining why both are needed. There are cases where the CPU ZMM# registers are not exposed by the OS or hypervisor even if the CPU supports AVX512.
The big change is adding all old and new build support to meson. I am new to meson/ninja so please review carefully.
Thanks,
Paul
-----Original Message-----
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Sent: Wednesday, February 7, 2024 2:13 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Hello,
This looks quite reasonable. On my machine, I get the compiler test to pass so I get a "yes" in configure; but of course my CPU doesn't support the instructions so I get the slow variant. So here's the patch again with some minor artifacts fixed.
I have the following review notes:
1. we use __get_cpuid_count and __cpuidex by relying on macros HAVE__GET_CPUID and HAVE__CPUID respectively; but those macros are (in the current Postgres source) only used and tested for __get_cpuid and __cpuid respectively. So unless there's some reason to be certain that __get_cpuid_count is always present when __get_cpuid is present, and that __cpuidex is present when __cpuid is present, I think we need to add new configure tests and new HAVE_ macros for these.
2. we rely on <immintrin.h> being present with no AC_CHECK_HEADER() test. We currently don't use this header anywhere, so I suppose we need a test for this one as well. (Also, I suppose if we don't have immintrin.h we can skip the rest of it?)
3. We do the __get_cpuid_count/__cpuidex test and we also do a xgetbv test. The comment there claims that this is to check the results for consistency. But ... how would we know that the results are ever inconsistent? As far as I understand, if they were, we would silently become slower. Is this really what we want? I'm confused about this coding. Maybe we do need both tests to succeed? In that case, just reword the comment.
I think if both tests are each considered reliable on its own, then we could either choose one of them and stick with it, ignoring the other; or we could use one as primary and then in a USE_ASSERT_CHECKING block verify that the other matches and throw a WARNING if not (but what would that tell us?). Or something like that ... not sure.
4. It needs meson support, which I suppose consists of copying the
c-compiler.m4 test into meson.build, mimicking what the tests for CRC instructions do.
I started a CI run with this patch applied,
https://cirrus-ci.com/build/4912499619790848
but because Meson support is missing, the compile failed
immediately:
[10:08:48.825] ccache cc -Isrc/port/libpgport_srv.a.p -Isrc/include -I../src/include -Isrc/include/utils -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -pthread -DBUILDING_DLL -MD -MQ src/port/libpgport_srv.a.p/pg_bitutils.c.o -MF src/port/libpgport_srv.a.p/pg_bitutils.c.o.d -o src/port/libpgport_srv.a.p/pg_bitutils.c.o -c ../src/port/pg_bitutils.c [10:08:48.825] ../src/port/pg_bitutils.c: In function ‘pg_popcount512_fast’:
[10:08:48.825] ../src/port/pg_bitutils.c:270:11: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~
[10:08:48.825] In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:55,
[10:08:48.825] from ../src/port/pg_bitutils.c:22:
[10:08:48.825] /usr/lib/gcc/x86_64-linux-gnu/10/include/avx512fintrin.h:339:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_setzero_si512’: target specific option mismatch
[10:08:48.825] 339 | _mm512_setzero_si512 (void)
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~
[10:08:48.825] ../src/port/pg_bitutils.c:270:25: note: called from here
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~~~
Thanks
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Attachments:
v4-0001-Add-support-for-AVX512-implemented-POPCNT.patchapplication/octet-stream; name=v4-0001-Add-support-for-AVX512-implemented-POPCNT.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..a5a3246199 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,36 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 2a1ee251f2..7fe669cda2 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -15209,7 +15210,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15255,7 +15256,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15279,7 +15280,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15324,7 +15325,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15348,7 +15349,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -17702,6 +17703,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17736,6 +17772,164 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 52fd7af446..ae110e9d59 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2067,6 +2067,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2078,6 +2090,36 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 8ed51b6aae..1e7a4dc942 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,45 @@ elif cc.links('''
endif
+# XXX: The configure.ac check for __cpuidex() is broken, we don't copy that
+# here. To prevent problems due to two detection methods working, stop
+# checking after one.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.links('''
+ #include <immintrin.h>
+ int main(int arg, char **argv)
+ {
+ return 1701;
+ }
+ ''', name: '__immintrin',
+ args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2146,6 +2185,32 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ void main(void)
+ {
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_setzero_si512, _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64 with -mavx512vpopcntdq -mavx512f',
+ args: test_c_args + ['-mavx512vpopcntdq', '-mavx512f'])
+ have_avx512_popcnt = true
+ cflags_avx512_popcnt = ['-mavx512vpopcntdq', '-mavx512f']
+ endif
+endif
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07e73567dc..20e14c6499 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,6 +555,12 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..6a01a7d89a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 69b30ab21b..1c48a3b07e 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -184,6 +184,7 @@ foreach name, opts : pgport_variants
link_with: cflag_libs,
c_pch: pch_c_h,
kwargs: opts + {
+ 'c_args': opts.get('c_args', []) + cflags_avx512_popcnt,
'dependencies': opts['dependencies'] + [ssl],
}
)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..cda41e7438 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -15,10 +15,15 @@
#ifdef HAVE__GET_CPUID
#include <cpuid.h>
#endif
+
#ifdef HAVE__CPUID
#include <intrin.h>
#endif
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
#include "port/pg_bitutils.h"
@@ -110,11 +115,15 @@ static int pg_popcount64_slow(uint64 word);
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount512_choose(const char *buf, int bytes);
static int pg_popcount32_fast(uint32 word);
static int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount512_fast(const char *buf, int bytes);
+static uint64 pg_popcount512_slow(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount512) (const char *buf, int bytes) = pg_popcount512_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -138,6 +147,50 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+#ifdef HAVE__IMMINTRIN
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error __get_cpuid_count or __cpuidex instruction not available
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__("xgetbv\t\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xE0) != 0;
+ } /* POPCNT 512 */
+#endif /* HAVE__IMMINTRIN */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -178,6 +231,17 @@ pg_popcount64_choose(uint64 word)
return pg_popcount64(word);
}
+static uint64
+pg_popcount512_choose(const char *buf, int bytes)
+{
+ if (pg_popcount512_available())
+ pg_popcount512 = pg_popcount512_fast;
+ else
+ pg_popcount512 = pg_popcount512_slow;
+
+ return pg_popcount512(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -212,6 +276,33 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+static uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+#ifdef HAVE__IMMINTRIN
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ bytes = bytes % 64;
+#endif /* HAVE__IMMINTRIN */
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -265,6 +356,31 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
+static uint64
+pg_popcount512_slow(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -286,6 +402,12 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount512(const char *buf, int bytes)
+{
+ return pg_popcount512_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
@@ -298,19 +420,7 @@ pg_popcount(const char *buf, int bytes)
uint64 popcnt = 0;
#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += pg_popcount64(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
+ return pg_popcount512(buf, bytes);
#else
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
Hi,
On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.
Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.
Greetings,
Andres
Hi,
On 2024-02-09 17:39:46 +0000, Amonson, Paul D wrote:
diff --git a/meson.build b/meson.build index 8ed51b6aae..1e7a4dc942 100644 --- a/meson.build +++ b/meson.build @@ -1773,6 +1773,45 @@ elif cc.links(''' endif+# XXX: The configure.ac check for __cpuidex() is broken, we don't copy that +# here. To prevent problems due to two detection methods working, stop +# checking after one.
This seems like a bogus copy-paste.
+if cc.links(''' + #include <cpuid.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + } + ''', name: '__get_cpuid_count', + args: test_c_args) + cdata.set('HAVE__GET_CPUID_COUNT', 1) +elif cc.links(''' + #include <intrin.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __cpuidex(exx, 7, 0); + } + ''', name: '__cpuidex', + args: test_c_args) + cdata.set('HAVE__CPUIDEX', 1) +endif + + +# Check for header immintrin.h +if cc.links(''' + #include <immintrin.h> + int main(int arg, char **argv) + { + return 1701; + } + ''', name: '__immintrin', + args: test_c_args) + cdata.set('HAVE__IMMINTRIN', 1) +endif
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().
+############################################################### +# AVX 512 POPCNT Intrinsic check +############################################################### +have_avx512_popcnt = false +cflags_avx512_popcnt = [] +if host_cpu == 'x86_64' + prog = ''' + #include <immintrin.h> + #include <stdint.h> + void main(void) + { + __m512i tmp __attribute__((aligned(64))); + __m512i input = _mm512_setzero_si512(); + __m512i output = _mm512_popcnt_epi64(input); + uint64_t cnt = 999; + _mm512_store_si512(&tmp, output); + cnt = _mm512_reduce_add_epi64(tmp); + /* return computed value, to prevent the above being optimized away */ + return cnt == 0; + }'''
Does this work with msvc?
+ if cc.links(prog, name: '_mm512_setzero_si512, _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64 with -mavx512vpopcntdq -mavx512f',
That's a very long line in the output, how about using the avx feature name or
something?
diff --git a/src/port/Makefile b/src/port/Makefile index dcc8737e68..6a01a7d89a 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024) +pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT) +pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT) +pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT) + # all versions of pg_crc32c_armv8.o need CFLAGS_CRC pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) diff --git a/src/port/meson.build b/src/port/meson.build index 69b30ab21b..1c48a3b07e 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -184,6 +184,7 @@ foreach name, opts : pgport_variants link_with: cflag_libs, c_pch: pch_c_h, kwargs: opts + { + 'c_args': opts.get('c_args', []) + cflags_avx512_popcnt, 'dependencies': opts['dependencies'] + [ssl], } )
This will build all of pgport with the avx flags, which wouldn't be correct, I
think? The compiler might inject automatic uses of avx512 in places, which
would cause problems, no?
While you don't do the same for make, isn't even just using the avx512 for all
of pg_bitutils.c broken for exactly that reson? That's why the existing code
builds the files for various crc variants as their own file.
Greetings,
Andres Freund
On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.
The IFUNC creators introduced it so glibc could use arch-specific memcpy with
the instruction sequence of a non-pointer, extern function call, not the
instruction sequence of a function pointer call. I don't know why the
upthread ifunc_test.patch benchmark found ifunc performing worse than function
pointers. However, it would be odd if toolchains have replaced the original
IFUNC with something equivalent to or slower than function pointers.
Hi,
On 2024-02-09 15:27:57 -0800, Noah Misch wrote:
On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.The IFUNC creators introduced it so glibc could use arch-specific memcpy with
the instruction sequence of a non-pointer, extern function call, not the
instruction sequence of a function pointer call.
My understanding is that the ifunc mechanism just avoid the need for repeated
indirect calls/jumps to implement a single function call, not the use of
indirect function calls at all. Calls into shared libraries, like libc, are
indirected via the GOT / PLT, i.e. an indirect function call/jump. Without
ifuncs, the target of the function call would then have to dispatch to the
resolved function. Ifuncs allow to avoid this repeated dispatch by moving the
dispatch to the dynamic linker stage, modifying the contents of the GOT/PLT to
point to the right function. Thus ifuncs are an optimization when calling a
function in a shared library that's then dispatched depending on the cpu
capabilities.
However, in our case, where the code is in the same binary, function calls
implemented in the main binary directly (possibly via a static library) don't
go through GOT/PLT. In such a case, use of ifuncs turns a normal direct
function call into one going through the GOT/PLT, i.e. makes it indirect. The
same is true for calls within a shared library if either explicit symbol
visibility is used, or -symbolic, -Wl,-Bsymbolic or such is used. Therefore
there's no efficiency gain of ifuncs over a call via function pointer.
This isn't because ifunc is implemented badly or something - the reason for
this is that dynamic relocations aren't typically implemented by patching all
callsites (".text relocations"), which is what you would need to avoid the
need for an indirect call to something that fundamentally cannot be a constant
address at link time. The reason text relocations are disfavored is that
they can make program startup quite slow, that they require allowing
modifications to executable pages which are disliked due to the security
implications, and that they make the code non-shareable, as the in-memory
executable code has to differ from the on-disk code.
I actually think ifuncs within the same binary are a tad *slower* than plain
function pointer calls, unless -fno-plt is used. Without -fno-plt, an ifunc is
called by 1) a direct call into the PLT, 2) loading the target address from
the GOT, 3) making an an indirect jump to that address. Whereas a "plain
indirect function call" is just 1) load target address from variable 2) making
an indirect jump to that address. With -fno-plt the callsites themselves load
the address from the GOT.
Greetings,
Andres Freund
On Fri, Feb 09, 2024 at 08:33:23PM -0800, Andres Freund wrote:
On 2024-02-09 15:27:57 -0800, Noah Misch wrote:
On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:
On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:
This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.The IFUNC creators introduced it so glibc could use arch-specific memcpy with
the instruction sequence of a non-pointer, extern function call, not the
instruction sequence of a function pointer call.My understanding is that the ifunc mechanism just avoid the need for repeated
indirect calls/jumps to implement a single function call, not the use of
indirect function calls at all. Calls into shared libraries, like libc, are
indirected via the GOT / PLT, i.e. an indirect function call/jump. Without
ifuncs, the target of the function call would then have to dispatch to the
resolved function. Ifuncs allow to avoid this repeated dispatch by moving the
dispatch to the dynamic linker stage, modifying the contents of the GOT/PLT to
point to the right function. Thus ifuncs are an optimization when calling a
function in a shared library that's then dispatched depending on the cpu
capabilities.However, in our case, where the code is in the same binary, function calls
implemented in the main binary directly (possibly via a static library) don't
go through GOT/PLT. In such a case, use of ifuncs turns a normal direct
function call into one going through the GOT/PLT, i.e. makes it indirect. The
same is true for calls within a shared library if either explicit symbol
visibility is used, or -symbolic, -Wl,-Bsymbolic or such is used. Therefore
there's no efficiency gain of ifuncs over a call via function pointer.This isn't because ifunc is implemented badly or something - the reason for
this is that dynamic relocations aren't typically implemented by patching all
callsites (".text relocations"), which is what you would need to avoid the
need for an indirect call to something that fundamentally cannot be a constant
address at link time. The reason text relocations are disfavored is that
they can make program startup quite slow, that they require allowing
modifications to executable pages which are disliked due to the security
implications, and that they make the code non-shareable, as the in-memory
executable code has to differ from the on-disk code.I actually think ifuncs within the same binary are a tad *slower* than plain
function pointer calls, unless -fno-plt is used. Without -fno-plt, an ifunc is
called by 1) a direct call into the PLT, 2) loading the target address from
the GOT, 3) making an an indirect jump to that address. Whereas a "plain
indirect function call" is just 1) load target address from variable 2) making
an indirect jump to that address. With -fno-plt the callsites themselves load
the address from the GOT.
That sounds more accurate than what I wrote. Thanks.
My responses with questions,
+# XXX: The configure.ac check for __cpuidex() is broken, we don't +copy that # here. To prevent problems due to two detection methods +working, stop # checking after one.This seems like a bogus copy-paste.
My bad. Will remove the offending comment. :)
+# Check for header immintrin.h
...
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().
I took this to mean the last of the 3 new blocks. I changed this one to the cc_has_header method. I think I do want the first 2 checking the link as well. If the don't link here they won't link in the actual build.
Does this work with msvc?
I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?
That's a very long line in the output, how about using the avx feature name or something?
Agree, will fix.
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
This will take me some time to learn how to do this in meson. Any pointers here would be helpful.
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
I don't think its broken, nothing else in pg_bitutils.c will make use of AVX-512, so I am not sure what dividing this up into multiple files will yield benefits beyond code readability as they will all be needed during compile time. I prefer to not split if the community agrees to it.
If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.
Could you elaborate if there are other benefits to the split file approach?
Paul
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Friday, February 9, 2024 10:35 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Hi,
On 2024-02-09 17:39:46 +0000, Amonson, Paul D wrote:
diff --git a/meson.build b/meson.build index 8ed51b6aae..1e7a4dc942 100644 --- a/meson.build +++ b/meson.build @@ -1773,6 +1773,45 @@ elif cc.links(''' endif+# XXX: The configure.ac check for __cpuidex() is broken, we don't +copy that # here. To prevent problems due to two detection methods +working, stop # checking after one.
This seems like a bogus copy-paste.
+if cc.links(''' + #include <cpuid.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); + } + ''', name: '__get_cpuid_count', + args: test_c_args) + cdata.set('HAVE__GET_CPUID_COUNT', 1) elif cc.links(''' + #include <intrin.h> + int main(int arg, char **argv) + { + unsigned int exx[4] = {0, 0, 0, 0}; + __cpuidex(exx, 7, 0); + } + ''', name: '__cpuidex', + args: test_c_args) + cdata.set('HAVE__CPUIDEX', 1) +endif + + +# Check for header immintrin.h +if cc.links(''' + #include <immintrin.h> + int main(int arg, char **argv) + { + return 1701; + } + ''', name: '__immintrin', + args: test_c_args) + cdata.set('HAVE__IMMINTRIN', 1) +endif
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().
+############################################################### +# AVX 512 POPCNT Intrinsic check +############################################################### +have_avx512_popcnt = false +cflags_avx512_popcnt = [] +if host_cpu == 'x86_64' + prog = ''' + #include <immintrin.h> + #include <stdint.h> + void main(void) + { + __m512i tmp __attribute__((aligned(64))); + __m512i input = _mm512_setzero_si512(); + __m512i output = _mm512_popcnt_epi64(input); + uint64_t cnt = 999; + _mm512_store_si512(&tmp, output); + cnt = _mm512_reduce_add_epi64(tmp); + /* return computed value, to prevent the above being optimized away */ + return cnt == 0; + }'''
Does this work with msvc?
+ if cc.links(prog, name: '_mm512_setzero_si512, + _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64 + with -mavx512vpopcntdq -mavx512f',
That's a very long line in the output, how about using the avx feature name or something?
diff --git a/src/port/Makefile b/src/port/Makefile index dcc8737e68..6a01a7d89a 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)+# Newer Intel processors can use AVX-512 POPCNT Capabilities +(01/30/2024) +pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT) +pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT) +pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT) + # all versions of pg_crc32c_armv8.o need CFLAGS_CRC pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC) pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) diff --git a/src/port/meson.build b/src/port/meson.build index 69b30ab21b..1c48a3b07e 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -184,6 +184,7 @@ foreach name, opts : pgport_variants link_with: cflag_libs, c_pch: pch_c_h, kwargs: opts + { + 'c_args': opts.get('c_args', []) + cflags_avx512_popcnt, 'dependencies': opts['dependencies'] + [ssl], } )
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
Greetings,
Andres Freund
Hi,
On 2024-02-12 20:14:06 +0000, Amonson, Paul D wrote:
+# Check for header immintrin.h
...
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().I took this to mean the last of the 3 new blocks.
Yep.
Does this work with msvc?
I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?
CI [1]https://github.com/postgres/postgres/blob/master/src/tools/ci/README, whould be able to test at least building. Including via cfbot,
automatically run for each commitfest entry - you can see prior runs at
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
tested. If you look at [3]https://cirrus-ci.com/task/5645112189911040, you can see that currently it doesn't seem to be
considered supported at configure time:
...
[00:23:48.480] Checking if "__get_cpuid" : links: NO
[00:23:48.480] Checking if "__cpuid" : links: YES
...
[00:23:48.492] Checking if "x86_64: popcntq instruction" compiles: NO
...
Unfortunately CI currently is configured to not upload the build logs if the
build succeeds, so we don't have enough details to see why.
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
This will take me some time to learn how to do this in meson. Any pointers
here would be helpful.
Should be fairly simple, add it to the replace_funcs_pos and add the relevant
cflags to pgport_cflags, similar to how it's done for crc.
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
I don't think its broken, nothing else in pg_bitutils.c will make use of
AVX-512
You can't really guarantee that compiler auto-vectorization won't decide to do
so, no? I wouldn't call it likely, but it's also hard to be sure it won't
happen at some point.
If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.Could you elaborate if there are other benefits to the split file approach?
It won't lead to SIGILLs ;)
Greetings,
Andres Freund
[1]: https://github.com/postgres/postgres/blob/master/src/tools/ci/README
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
[3]: https://cirrus-ci.com/task/5645112189911040
On Sat, Feb 10, 2024 at 03:52:38PM -0800, Noah Misch wrote:
On Fri, Feb 09, 2024 at 08:33:23PM -0800, Andres Freund wrote:
My understanding is that the ifunc mechanism just avoid the need for repeated
indirect calls/jumps to implement a single function call, not the use of
indirect function calls at all. Calls into shared libraries, like libc, are
indirected via the GOT / PLT, i.e. an indirect function call/jump. Without
ifuncs, the target of the function call would then have to dispatch to the
resolved function. Ifuncs allow to avoid this repeated dispatch by moving the
dispatch to the dynamic linker stage, modifying the contents of the GOT/PLT to
point to the right function. Thus ifuncs are an optimization when calling a
function in a shared library that's then dispatched depending on the cpu
capabilities.However, in our case, where the code is in the same binary, function calls
implemented in the main binary directly (possibly via a static library) don't
go through GOT/PLT. In such a case, use of ifuncs turns a normal direct
function call into one going through the GOT/PLT, i.e. makes it indirect. The
same is true for calls within a shared library if either explicit symbol
visibility is used, or -symbolic, -Wl,-Bsymbolic or such is used. Therefore
there's no efficiency gain of ifuncs over a call via function pointer.This isn't because ifunc is implemented badly or something - the reason for
this is that dynamic relocations aren't typically implemented by patching all
callsites (".text relocations"), which is what you would need to avoid the
need for an indirect call to something that fundamentally cannot be a constant
address at link time. The reason text relocations are disfavored is that
they can make program startup quite slow, that they require allowing
modifications to executable pages which are disliked due to the security
implications, and that they make the code non-shareable, as the in-memory
executable code has to differ from the on-disk code.I actually think ifuncs within the same binary are a tad *slower* than plain
function pointer calls, unless -fno-plt is used. Without -fno-plt, an ifunc is
called by 1) a direct call into the PLT, 2) loading the target address from
the GOT, 3) making an an indirect jump to that address. Whereas a "plain
indirect function call" is just 1) load target address from variable 2) making
an indirect jump to that address. With -fno-plt the callsites themselves load
the address from the GOT.That sounds more accurate than what I wrote. Thanks.
+1, thanks for the detailed explanation, Andres.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
I am encountering a problem that I don't think I understand. I cannot get the MSVC build to link in CI. I added 2 files to the build, but the linker is complaining about the original pg_bitutils.c file is missing (specifically symbol 'pg_popcount'). To my knowledge my changes did not change linking for the offending file and I see the compiles for pg_bitutils.c in all 3 libs in the build. All other builds are compiling.
Any help on this issue would be greatly appreciated.
My fork is at https://github.com/paul-amonson/postgresql/tree/popcnt_patch and the CI build is at https://cirrus-ci.com/task/4927666021728256.
Thanks,
Paul
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, February 12, 2024 12:37 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Hi,
On 2024-02-12 20:14:06 +0000, Amonson, Paul D wrote:
+# Check for header immintrin.h
...
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().I took this to mean the last of the 3 new blocks.
Yep.
Does this work with msvc?
I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?
CI [1]https://github.com/postgres/postgres/blob/master/src/tools/ci/README, whould be able to test at least building. Including via cfbot, automatically run for each commitfest entry - you can see prior runs at [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675. They run on Zen 3 epyc instances, so unfortunately runtime won't be tested. If you look at [3]https://cirrus-ci.com/task/5645112189911040, you can see that currently it doesn't seem to be considered supported at configure time:
...
[00:23:48.480] Checking if "__get_cpuid" : links: NO [00:23:48.480] Checking if "__cpuid" : links: YES ...
[00:23:48.492] Checking if "x86_64: popcntq instruction" compiles: NO ...
Unfortunately CI currently is configured to not upload the build logs if the build succeeds, so we don't have enough details to see why.
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
This will take me some time to learn how to do this in meson. Any
pointers here would be helpful.
Should be fairly simple, add it to the replace_funcs_pos and add the relevant cflags to pgport_cflags, similar to how it's done for crc.
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
I don't think its broken, nothing else in pg_bitutils.c will make use
of
AVX-512
You can't really guarantee that compiler auto-vectorization won't decide to do so, no? I wouldn't call it likely, but it's also hard to be sure it won't happen at some point.
If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.Could you elaborate if there are other benefits to the split file approach?
It won't lead to SIGILLs ;)
Greetings,
Andres Freund
[1]: https://github.com/postgres/postgres/blob/master/src/tools/ci/README
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
[3]: https://cirrus-ci.com/task/5645112189911040
Hello again,
This is now a blocking issue. I can find no reason for the failing behavior of the MSVC build. All other languages build fine in CI including the Mac. Since the master branch builds, I assume I changed something critical to linking, but I can't figure out what that would be. Can someone with Windows/MSVC experience help me?
* Code: https://github.com/paul-amonson/postgresql/tree/popcnt_patch
* CI build: https://cirrus-ci.com/task/4927666021728256
Thanks,
Paul
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Wednesday, February 21, 2024 9:36 AM
To: Andres Freund <andres@anarazel.de>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: RE: Popcount optimization using AVX512
Hi,
I am encountering a problem that I don't think I understand. I cannot get the MSVC build to link in CI. I added 2 files to the build, but the linker is complaining about the original pg_bitutils.c file is missing (specifically symbol 'pg_popcount'). To my knowledge my changes did not change linking for the offending file and I see the compiles for pg_bitutils.c in all 3 libs in the build. All other builds are compiling.
Any help on this issue would be greatly appreciated.
My fork is at https://github.com/paul-amonson/postgresql/tree/popcnt_patch and the CI build is at https://cirrus-ci.com/task/4927666021728256.
Thanks,
Paul
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, February 12, 2024 12:37 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Hi,
On 2024-02-12 20:14:06 +0000, Amonson, Paul D wrote:
+# Check for header immintrin.h
...
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().I took this to mean the last of the 3 new blocks.
Yep.
Does this work with msvc?
I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?
CI [1]https://github.com/postgres/postgres/blob/master/src/tools/ci/README, whould be able to test at least building. Including via cfbot, automatically run for each commitfest entry - you can see prior runs at [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675. They run on Zen 3 epyc instances, so unfortunately runtime won't be tested. If you look at [3]https://cirrus-ci.com/task/5645112189911040, you can see that currently it doesn't seem to be considered supported at configure time:
...
[00:23:48.480] Checking if "__get_cpuid" : links: NO [00:23:48.480] Checking if "__cpuid" : links: YES ...
[00:23:48.492] Checking if "x86_64: popcntq instruction" compiles: NO ...
Unfortunately CI currently is configured to not upload the build logs if the build succeeds, so we don't have enough details to see why.
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
This will take me some time to learn how to do this in meson. Any
pointers here would be helpful.
Should be fairly simple, add it to the replace_funcs_pos and add the relevant cflags to pgport_cflags, similar to how it's done for crc.
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
I don't think its broken, nothing else in pg_bitutils.c will make use
of
AVX-512
You can't really guarantee that compiler auto-vectorization won't decide to do so, no? I wouldn't call it likely, but it's also hard to be sure it won't happen at some point.
If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.Could you elaborate if there are other benefits to the split file approach?
It won't lead to SIGILLs ;)
Greetings,
Andres Freund
[1]: https://github.com/postgres/postgres/blob/master/src/tools/ci/README
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
[3]: https://cirrus-ci.com/task/5645112189911040
Andres,
After consulting some Intel internal experts on MSVC the linking issue as it stood was not resolved. Instead, I created a MSVC ONLY work-around. This adds one extra functional call on the Windows builds (The linker resolves a real function just fine but not a function pointer of the same name). This extra latency does not exist on any of the other platforms. I also believe I addressed all issues raised in the previous reviews. The new pg_popcnt_x86_64_accel.c file is now the ONLY file compiled with the AVX512 compiler flags. I added support for the MSVC compiler flag as well. Both meson and autoconf are updated with the new refactor.
I am attaching the new patch.
Paul
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Monday, February 26, 2024 9:57 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>; Andres Freund <andres@anarazel.de>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: RE: Popcount optimization using AVX512
Hello again,
This is now a blocking issue. I can find no reason for the failing behavior of the MSVC build. All other languages build fine in CI including the Mac. Since the master branch builds, I assume I changed something critical to linking, but I can't figure out what that would be. Can someone with Windows/MSVC experience help me?
* Code: https://github.com/paul-amonson/postgresql/tree/popcnt_patch
* CI build: https://cirrus-ci.com/task/4927666021728256
Thanks,
Paul
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Wednesday, February 21, 2024 9:36 AM
To: Andres Freund <andres@anarazel.de>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: RE: Popcount optimization using AVX512
Hi,
I am encountering a problem that I don't think I understand. I cannot get the MSVC build to link in CI. I added 2 files to the build, but the linker is complaining about the original pg_bitutils.c file is missing (specifically symbol 'pg_popcount'). To my knowledge my changes did not change linking for the offending file and I see the compiles for pg_bitutils.c in all 3 libs in the build. All other builds are compiling.
Any help on this issue would be greatly appreciated.
My fork is at https://github.com/paul-amonson/postgresql/tree/popcnt_patch and the CI build is at https://cirrus-ci.com/task/4927666021728256.
Thanks,
Paul
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Monday, February 12, 2024 12:37 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Hi,
On 2024-02-12 20:14:06 +0000, Amonson, Paul D wrote:
+# Check for header immintrin.h
...
Do these all actually have to link? Invoking the linker is slow.
I think you might be able to just use cc.has_header_symbol().I took this to mean the last of the 3 new blocks.
Yep.
Does this work with msvc?
I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?
CI [1]https://github.com/postgres/postgres/blob/master/src/tools/ci/README, whould be able to test at least building. Including via cfbot, automatically run for each commitfest entry - you can see prior runs at [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675. They run on Zen 3 epyc instances, so unfortunately runtime won't be tested. If you look at [3]https://cirrus-ci.com/task/5645112189911040, you can see that currently it doesn't seem to be considered supported at configure time:
...
[00:23:48.480] Checking if "__get_cpuid" : links: NO [00:23:48.480] Checking if "__cpuid" : links: YES ...
[00:23:48.492] Checking if "x86_64: popcntq instruction" compiles: NO ...
Unfortunately CI currently is configured to not upload the build logs if the build succeeds, so we don't have enough details to see why.
This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?
This will take me some time to learn how to do this in meson. Any
pointers here would be helpful.
Should be fairly simple, add it to the replace_funcs_pos and add the relevant cflags to pgport_cflags, similar to how it's done for crc.
While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.
I don't think its broken, nothing else in pg_bitutils.c will make use
of
AVX-512
You can't really guarantee that compiler auto-vectorization won't decide to do so, no? I wouldn't call it likely, but it's also hard to be sure it won't happen at some point.
If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.Could you elaborate if there are other benefits to the split file approach?
It won't lead to SIGILLs ;)
Greetings,
Andres Freund
[1]: https://github.com/postgres/postgres/blob/master/src/tools/ci/README
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F47%2F4675
[3]: https://cirrus-ci.com/task/5645112189911040
Attachments:
v5-0001-Add-support-for-AVX512-implemented-POPCNT.patchapplication/octet-stream; name=v5-0001-Add-support-for-AVX512-implemented-POPCNT.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..a5a3246199 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,36 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 6b87e5c9a8..0252dab6d5 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -15209,7 +15210,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15255,7 +15256,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15279,7 +15280,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15324,7 +15325,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15348,7 +15349,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -17708,6 +17709,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17742,6 +17778,164 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 6e64ece11d..8fcf635b08 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2068,6 +2068,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2079,6 +2091,36 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 8ed51b6aae..f64c10c496 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2146,6 +2177,38 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ void main(void)
+ {
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq -mavx512f',
+ args: test_c_args + ['-mavx512vpopcntdq', '-mavx512f'])
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = ['-mavx512vpopcntdq', '-mavx512f']
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07e73567dc..20e14c6499 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,6 +555,12 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 799f70d052..438470348a 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,16 +303,18 @@ pg_ceil_log2_64(uint64 num)
extern int (*pg_popcount32) (uint32 word);
extern int (*pg_popcount64) (uint64 word);
+extern uint64 (*pg_popcount)(const char *buf, int bytes);
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-#endif /* TRY_POPCNT_FAST */
-
/* Count the number of one-bits in a byte array */
extern uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..ef6c02a6bf 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -43,6 +43,8 @@ OBJS = \
inet_net_ntop.o \
noblock.o \
path.o \
+ pg_popcnt_choose.o \
+ pg_popcnt_x86_64_accel.o \
pg_bitutils.o \
pg_strong_random.o \
pgcheckdir.o \
@@ -87,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_popcnt_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..d7930672cb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcnt_choose.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcnt_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..90cfa65333 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -12,16 +12,8 @@
*/
#include "c.h"
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
-
/*
* Array giving the position of the left-most set bit for each possible
* byte value. We count the right-most position as the 0th bit, and the
@@ -78,6 +70,7 @@ const uint8 pg_rightmost_one_pos[256] = {
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
};
+
/*
* Array giving the number of 1-bits in each possible byte value.
*
@@ -103,123 +96,27 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+extern int pg_popcount32_choose(uint32 word);
+extern int pg_popcount64_choose(uint64 word);
+extern uint64 pg_popcount_choose(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount32_fast(uint32 word)
-{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-#endif /* TRY_POPCNT_FAST */
-
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+#else /* TRY_POPCNT_FAST */
+uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +138,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -286,22 +183,29 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
* pg_popcount
- * Returns the number of 1-bits in buf
+ * Returns the number of 1-bits in buf using either 32 or 64 bit loops
+ * or fallback to __builtin_* or pure software.
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
+ if (buf == (const char *)TYPEALIGN(8, buf))
{
- const uint64 *words = (const uint64 *) buf;
+ const uint64 *words = (const uint64 *)buf;
while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
- buf = (const char *) words;
+ buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
Thanks for the new version of the patch. I didn't see a commitfest entry
for this one, and unfortunately I think it's too late to add it for the
March commitfest. I would encourage you to add it to July's commitfest [0]https://commitfest.postgresql.org/48/
so that we can get some routine cfbot coverage.
On Tue, Feb 27, 2024 at 08:46:06PM +0000, Amonson, Paul D wrote:
After consulting some Intel internal experts on MSVC the linking issue as
it stood was not resolved. Instead, I created a MSVC ONLY work-around.
This adds one extra functional call on the Windows builds (The linker
resolves a real function just fine but not a function pointer of the same
name). This extra latency does not exist on any of the other platforms. I
also believe I addressed all issues raised in the previous reviews. The
new pg_popcnt_x86_64_accel.c file is now the ONLY file compiled with the
AVX512 compiler flags. I added support for the MSVC compiler flag as
well. Both meson and autoconf are updated with the new refactor.I am attaching the new patch.
I think this patch might be missing the new files.
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
IME this means that the autoconf you are using has been patched. A quick
search on the mailing lists seems to indicate that it might be specific to
Debian [1]/messages/by-id/20230211020042.uthdgj72kp3xlqam@awork3.anarazel.de.
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
This patch appears to do a lot of refactoring. Would it be possible to
break out the refactoring parts into a prerequisite patch that could be
reviewed and committed independently from the AVX512 stuff?
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
+ if (buf == (const char *)TYPEALIGN(8, buf))
{
- const uint64 *words = (const uint64 *) buf;
+ const uint64 *words = (const uint64 *)buf;
while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
- buf = (const char *) words;
+ buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
Most, if not all, of these changes seem extraneous. Do we actually need to
more strictly check SIZEOF_VOID_P?
[0]: https://commitfest.postgresql.org/48/
[1]: /messages/by-id/20230211020042.uthdgj72kp3xlqam@awork3.anarazel.de
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
First, apologies on the patch. Find re-attached updated version.
Now I have some questions....
#1
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) +<< 31))IME this means that the autoconf you are using has been patched. A quick search on the mailing lists seems to indicate that it might be specific to Debian [1].
I am not sure what the ask is here? I made changes to the configure.ac and ran autoconf2.69 to get builds to succeed. Do you have a separate feedback here?
#2
As for the refactoring, this was done to satisfy previous review feedback about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly to avoid segfault due to the AVX512 flags. If its ok, I would prefer to make a single commit as the change is pretty small and straight forward.
#3
I am not sure I understand the comment about the SIZE_VOID_P checks. Aren't they necessary to choose which functions to call based on 32 or 64 bit architectures?
#4
Would this change qualify for Workflow A as described in [0]https://commitfest.postgresql.org/48/ and can be picked up by a committer, given it has been reviewed by multiple committers so far? The scope of the change is pretty contained as well.
[0]: https://commitfest.postgresql.org/48/
Thanks,
Paul
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Friday, March 1, 2024 1:45 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
Thanks for the new version of the patch. I didn't see a commitfest entry for this one, and unfortunately I think it's too late to add it for the March commitfest. I would encourage you to add it to July's commitfest [0]https://commitfest.postgresql.org/48/ so that we can get some routine cfbot coverage.
On Tue, Feb 27, 2024 at 08:46:06PM +0000, Amonson, Paul D wrote:
After consulting some Intel internal experts on MSVC the linking issue
as it stood was not resolved. Instead, I created a MSVC ONLY work-around.
This adds one extra functional call on the Windows builds (The linker
resolves a real function just fine but not a function pointer of the
same name). This extra latency does not exist on any of the other
platforms. I also believe I addressed all issues raised in the
previous reviews. The new pg_popcnt_x86_64_accel.c file is now the
ONLY file compiled with the
AVX512 compiler flags. I added support for the MSVC compiler flag as
well. Both meson and autoconf are updated with the new refactor.I am attaching the new patch.
I think this patch might be missing the new files.
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
+<< 31))
IME this means that the autoconf you are using has been patched. A quick search on the mailing lists seems to indicate that it might be specific to Debian [1]/messages/by-id/20230211020042.uthdgj72kp3xlqam@awork3.anarazel.de.
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
This patch appears to do a lot of refactoring. Would it be possible to break out the refactoring parts into a prerequisite patch that could be reviewed and committed independently from the AVX512 stuff?
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
+ if (buf == (const char *)TYPEALIGN(8, buf))
{
- const uint64 *words = (const uint64 *) buf;
+ const uint64 *words = (const uint64 *)buf;
while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
- buf = (const char *) words;
+ buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
Most, if not all, of these changes seem extraneous. Do we actually need to more strictly check SIZEOF_VOID_P?
[0]: https://commitfest.postgresql.org/48/
[1]: /messages/by-id/20230211020042.uthdgj72kp3xlqam@awork3.anarazel.de
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v5-0001-Add-support-for-AVX512-implemented-POPCNT.patchapplication/octet-stream; name=v5-0001-Add-support-for-AVX512-implemented-POPCNT.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..a5a3246199 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,36 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 6b87e5c9a8..0252dab6d5 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -15209,7 +15210,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15255,7 +15256,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15279,7 +15280,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15324,7 +15325,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -15348,7 +15349,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -17708,6 +17709,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17742,6 +17778,164 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 6e64ece11d..8fcf635b08 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2068,6 +2068,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2079,6 +2091,36 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 8ed51b6aae..bd297d9fa9 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2146,6 +2177,43 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq', '-mavx512f']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ void main(void)
+ {
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq -mavx512f',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07e73567dc..20e14c6499 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,6 +555,12 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 799f70d052..caca78d805 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,16 +303,23 @@ pg_ceil_log2_64(uint64 num)
extern int (*pg_popcount32) (uint32 word);
extern int (*pg_popcount64) (uint64 word);
+#if defined(_MSC_VER)
+extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 (*pg_popcount_indirect)(const char *buf, int bytes);
+#else
+extern uint64 (*pg_popcount)(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-#endif /* TRY_POPCNT_FAST */
-
/* Count the number of one-bits in a byte array */
extern uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..ef6c02a6bf 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -43,6 +43,8 @@ OBJS = \
inet_net_ntop.o \
noblock.o \
path.o \
+ pg_popcnt_choose.o \
+ pg_popcnt_x86_64_accel.o \
pg_bitutils.o \
pg_strong_random.o \
pgcheckdir.o \
@@ -87,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_popcnt_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..d7930672cb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcnt_choose.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcnt_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..942e396141 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -12,16 +12,8 @@
*/
#include "c.h"
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
-
/*
* Array giving the position of the left-most set bit for each possible
* byte value. We count the right-most position as the 0th bit, and the
@@ -78,6 +70,7 @@ const uint8 pg_rightmost_one_pos[256] = {
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
};
+
/*
* Array giving the number of 1-bits in each possible byte value.
*
@@ -103,123 +96,35 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+extern int pg_popcount32_choose(uint32 word);
+extern int pg_popcount64_choose(uint64 word);
+extern uint64 pg_popcount_choose(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount32_fast(uint32 word)
+#if defined(_MSC_VER)
+uint64 (*pg_popcount_indirect)(const char *buf, int bytes) = pg_popcount_choose;
+uint64 pg_popcount(const char *buf, int bytes)
{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount_indirect(buf, bytes);
}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif
-}
-
-#endif /* TRY_POPCNT_FAST */
-
+#else /* TRY_POPCNT_FAST */
+uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +146,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -286,22 +191,29 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
* pg_popcount
- * Returns the number of 1-bits in buf
+ * Returns the number of 1-bits in buf using either 32 or 64 bit loops
+ * or fallback to __builtin_* or pure software.
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
+ if (buf == (const char *)TYPEALIGN(8, buf))
{
- const uint64 *words = (const uint64 *) buf;
+ const uint64 *words = (const uint64 *)buf;
while (bytes >= 8)
{
@@ -309,9 +221,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
- buf = (const char *) words;
+ buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
diff --git a/src/port/pg_popcnt_choose.c b/src/port/pg_popcnt_choose.c
new file mode 100644
index 0000000000..e170e16ff9
--- /dev/null
+++ b/src/port/pg_popcnt_choose.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+static bool pg_popcount_available(void);
+int pg_popcount32_choose(uint32 word);
+int pg_popcount64_choose(uint64 word);
+uint64 pg_popcount_choose(const char *buf, int bytes);
+
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount512_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+extern uint64 (*pg_popcount_indirect)(const char *buf, int bytes);
+
+extern int (*pg_popcount32)(uint32 word);
+extern int (*pg_popcount64)(uint64 word);
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static void set_up_function_pointers()
+{
+ if (pg_popcount512_available())
+ {
+#if defined(_MSC_VER)
+ pg_popcount_indirect = pg_popcount512_fast;
+#else
+ pg_popcount = pg_popcount512_fast;
+#endif
+ }
+ else
+ {
+#if defined(_MSC_VER)
+ pg_popcount_indirect = pg_popcount_slow;
+#else
+ pg_popcount = pg_popcount_slow;
+#endif
+ }
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ }
+}
+
+int pg_popcount32_choose(uint32 word)
+{
+ set_up_function_pointers();
+ return pg_popcount32(word);
+}
+
+int
+pg_popcount64_choose(uint64 word)
+{
+ set_up_function_pointers();
+ return pg_popcount64(word);
+}
+
+uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_up_function_pointers();
+#if defined(_MSC_VER)
+ return pg_popcount_indirect(buf, bytes);
+#else
+ return pg_popcount(buf, bytes);
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcnt_x86_64_accel.c b/src/port/pg_popcnt_x86_64_accel.c
new file mode 100644
index 0000000000..aef32c1174
--- /dev/null
+++ b/src/port/pg_popcnt_x86_64_accel.c
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+extern const uint8 pg_number_of_ones[256];
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+ __asm__ __volatile__(" popcntl %1,%0\n" : "=q"(res) : "rm"(word) : "cc");
+ return (int)res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+ __asm__ __volatile__(" popcntq %1,%0\n" : "=q"(res) : "rm"(word) : "cc");
+ return (int)res;
+#endif
+}
+
+/*
+ * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
+ * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+#if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ uint64 popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char)*buf++];
+ return popcnt;
+#else
+ return pg_popcount_slow(buf, bytes);
+#endif /* USE_AVX512_CODE */
+}
+#endif /* TRY_POPCNT_FAST */
(Please don't top-post on the Postgres lists.)
On Mon, Mar 04, 2024 at 09:39:36PM +0000, Amonson, Paul D wrote:
First, apologies on the patch. Find re-attached updated version.
Thanks for the new version of the patch.
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) +<< 31))IME this means that the autoconf you are using has been patched. A
quick search on the mailing lists seems to indicate that it might be
specific to Debian [1].I am not sure what the ask is here? I made changes to the configure.ac
and ran autoconf2.69 to get builds to succeed. Do you have a separate
feedback here?
These LARGE_OFF_T changes are unrelated to the patch at hand and should be
removed. This likely means that you are using a patched autoconf that is
making these extra changes.
As for the refactoring, this was done to satisfy previous review feedback
about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly
to avoid segfault due to the AVX512 flags. If its ok, I would prefer to
make a single commit as the change is pretty small and straight forward.
Okay. The only reason I suggest this is to ease review. For example, if
there is some required refactoring that doesn't involve any functionality
changes, it can be advantageous to get that part reviewed and committed
first so that reviewers can better focus on the code for the new feature.
But, of course, that isn't necessary and/or isn't possible in all cases.
I am not sure I understand the comment about the SIZE_VOID_P checks.
Aren't they necessary to choose which functions to call based on 32 or 64
bit architectures?
Yes. My comment was that the patch appeared to make unnecessary changes to
this code. Perhaps I am misunderstanding something here.
Would this change qualify for Workflow A as described in [0] and can be
picked up by a committer, given it has been reviewed by multiple
committers so far? The scope of the change is pretty contained as well.
I think so. I would still encourage you to create an entry for this so
that it is automatically tested via cfbot [0]http://commitfest.cputube.org/.
[0]: http://commitfest.cputube.org/
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
I am not sure what "top-post" means but I am not doing anything different but using "reply to all" in Outlook. Please enlighten me. 😊
This is the new patch with the hand edit to remove the offending lines from the patch file. I did a basic test to make the patch would apply and build. It succeeded.
Thanks,
Paul
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Monday, March 4, 2024 2:21 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512
(Please don't top-post on the Postgres lists.)
On Mon, Mar 04, 2024 at 09:39:36PM +0000, Amonson, Paul D wrote:
First, apologies on the patch. Find re-attached updated version.
Thanks for the new version of the patch.
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << +31) << 31))IME this means that the autoconf you are using has been patched. A
quick search on the mailing lists seems to indicate that it might be
specific to Debian [1].I am not sure what the ask is here? I made changes to the
configure.ac and ran autoconf2.69 to get builds to succeed. Do you
have a separate feedback here?
These LARGE_OFF_T changes are unrelated to the patch at hand and should be removed. This likely means that you are using a patched autoconf that is making these extra changes.
As for the refactoring, this was done to satisfy previous review
feedback about applying the AVX512 CFLAGS to the entire pg_bitutils.c
file. Mainly to avoid segfault due to the AVX512 flags. If its ok, I
would prefer to make a single commit as the change is pretty small and straight forward.
Okay. The only reason I suggest this is to ease review. For example, if there is some required refactoring that doesn't involve any functionality changes, it can be advantageous to get that part reviewed and committed first so that reviewers can better focus on the code for the new feature.
But, of course, that isn't necessary and/or isn't possible in all cases.
I am not sure I understand the comment about the SIZE_VOID_P checks.
Aren't they necessary to choose which functions to call based on 32 or
64 bit architectures?
Yes. My comment was that the patch appeared to make unnecessary changes to this code. Perhaps I am misunderstanding something here.
Would this change qualify for Workflow A as described in [0] and can
be picked up by a committer, given it has been reviewed by multiple
committers so far? The scope of the change is pretty contained as well.
I think so. I would still encourage you to create an entry for this so that it is automatically tested via cfbot [0]http://commitfest.cputube.org/.
[0]: http://commitfest.cputube.org/
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v6-0001-Add-support-for-AVX512-implemented-POPCNT.patchapplication/octet-stream; name=v6-0001-Add-support-for-AVX512-implemented-POPCNT.patchDownload
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5db02b2ab7..a5a3246199 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,36 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>],
+ [__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 6b87e5c9a8..0252dab6d5 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17708,6 +17709,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17742,6 +17778,164 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512f... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512f"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+int
+main ()
+{
+__m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512f" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq -mavx512f"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 6e64ece11d..8fcf635b08 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2068,6 +2068,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2079,6 +2091,36 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 8ed51b6aae..bd297d9fa9 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2146,6 +2177,43 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq', '-mavx512f']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ void main(void)
+ {
+ __m512i tmp __attribute__((aligned(64)));
+ __m512i input = _mm512_setzero_si512();
+ __m512i output = _mm512_popcnt_epi64(input);
+ uint64_t cnt = 999;
+ _mm512_store_si512(&tmp, output);
+ cnt = _mm512_reduce_add_epi64(tmp);
+ /* return computed value, to prevent the above being optimized away */
+ return cnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq -mavx512f',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07e73567dc..20e14c6499 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,6 +555,12 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 799f70d052..caca78d805 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,16 +303,23 @@ pg_ceil_log2_64(uint64 num)
extern int (*pg_popcount32) (uint32 word);
extern int (*pg_popcount64) (uint64 word);
+#if defined(_MSC_VER)
+extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 (*pg_popcount_indirect)(const char *buf, int bytes);
+#else
+extern uint64 (*pg_popcount)(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-#endif /* TRY_POPCNT_FAST */
-
/* Count the number of one-bits in a byte array */
extern uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..ef6c02a6bf 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -43,6 +43,8 @@ OBJS = \
inet_net_ntop.o \
noblock.o \
path.o \
+ pg_popcnt_choose.o \
+ pg_popcnt_x86_64_accel.o \
pg_bitutils.o \
pg_strong_random.o \
pgcheckdir.o \
@@ -87,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_popcnt_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..d7930672cb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcnt_choose.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcnt_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..942e396141 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -12,16 +12,8 @@
*/
#include "c.h"
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
-
/*
* Array giving the position of the left-most set bit for each possible
* byte value. We count the right-most position as the 0th bit, and the
@@ -78,6 +70,7 @@ const uint8 pg_rightmost_one_pos[256] = {
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
};
+
/*
* Array giving the number of 1-bits in each possible byte value.
*
@@ -103,123 +96,35 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+extern int pg_popcount32_choose(uint32 word);
+extern int pg_popcount64_choose(uint64 word);
+extern uint64 pg_popcount_choose(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount32_fast(uint32 word)
+#if defined(_MSC_VER)
+uint64 (*pg_popcount_indirect)(const char *buf, int bytes) = pg_popcount_choose;
+uint64 pg_popcount(const char *buf, int bytes)
{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount_indirect(buf, bytes);
}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif
-}
-
-#endif /* TRY_POPCNT_FAST */
-
+#else /* TRY_POPCNT_FAST */
+uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +146,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -286,22 +191,29 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
* pg_popcount
- * Returns the number of 1-bits in buf
+ * Returns the number of 1-bits in buf using either 32 or 64 bit loops
+ * or fallback to __builtin_* or pure software.
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
+ if (buf == (const char *)TYPEALIGN(8, buf))
{
- const uint64 *words = (const uint64 *) buf;
+ const uint64 *words = (const uint64 *)buf;
while (bytes >= 8)
{
@@ -309,9 +221,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
- buf = (const char *) words;
+ buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
diff --git a/src/port/pg_popcnt_choose.c b/src/port/pg_popcnt_choose.c
new file mode 100644
index 0000000000..e170e16ff9
--- /dev/null
+++ b/src/port/pg_popcnt_choose.c
@@ -0,0 +1,168 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+static bool pg_popcount_available(void);
+int pg_popcount32_choose(uint32 word);
+int pg_popcount64_choose(uint64 word);
+uint64 pg_popcount_choose(const char *buf, int bytes);
+
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount512_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+extern uint64 (*pg_popcount_indirect)(const char *buf, int bytes);
+
+extern int (*pg_popcount32)(uint32 word);
+extern int (*pg_popcount64)(uint64 word);
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static void set_up_function_pointers()
+{
+ if (pg_popcount512_available())
+ {
+#if defined(_MSC_VER)
+ pg_popcount_indirect = pg_popcount512_fast;
+#else
+ pg_popcount = pg_popcount512_fast;
+#endif
+ }
+ else
+ {
+#if defined(_MSC_VER)
+ pg_popcount_indirect = pg_popcount_slow;
+#else
+ pg_popcount = pg_popcount_slow;
+#endif
+ }
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ }
+}
+
+int pg_popcount32_choose(uint32 word)
+{
+ set_up_function_pointers();
+ return pg_popcount32(word);
+}
+
+int
+pg_popcount64_choose(uint64 word)
+{
+ set_up_function_pointers();
+ return pg_popcount64(word);
+}
+
+uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_up_function_pointers();
+#if defined(_MSC_VER)
+ return pg_popcount_indirect(buf, bytes);
+#else
+ return pg_popcount(buf, bytes);
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcnt_x86_64_accel.c b/src/port/pg_popcnt_x86_64_accel.c
new file mode 100644
index 0000000000..aef32c1174
--- /dev/null
+++ b/src/port/pg_popcnt_x86_64_accel.c
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+extern const uint8 pg_number_of_ones[256];
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+ __asm__ __volatile__(" popcntl %1,%0\n" : "=q"(res) : "rm"(word) : "cc");
+ return (int)res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+ __asm__ __volatile__(" popcntq %1,%0\n" : "=q"(res) : "rm"(word) : "cc");
+ return (int)res;
+#endif
+}
+
+/*
+ * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
+ * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+#if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ uint64 popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char)*buf++];
+ return popcnt;
+#else
+ return pg_popcount_slow(buf, bytes);
+#endif /* USE_AVX512_CODE */
+}
+#endif /* TRY_POPCNT_FAST */
On Tue, Mar 05, 2024 at 04:31:15PM +0000, Amonson, Paul D wrote:
I am not sure what "top-post" means but I am not doing anything different
but using "reply to all" in Outlook. Please enlighten me. 😊
The following link provides some more information:
https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Tuesday, March 5, 2024 8:38 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; >pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512On Tue, Mar 05, 2024 at 04:31:15PM +0000, Amonson, Paul D wrote:
I am not sure what "top-post" means but I am not doing anything
different but using "reply to all" in Outlook. Please enlighten me. 😊The following link provides some more information:
https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Ahhhh.....Ok... guess it's time to thank Microsoft then. ;) Noted I will try to do the "reduced" bottom-posting. I might slip up occasionally because it's an Intel habit. Is there a way to make Outlook do the leading ">" in a reply for the previous message?
BTW: Created the commit-fest submission.
Paul
On Tue, Mar 5, 2024 at 04:52:23PM +0000, Amonson, Paul D wrote:
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Tuesday, March 5, 2024 8:38 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; >pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512On Tue, Mar 05, 2024 at 04:31:15PM +0000, Amonson, Paul D wrote:
I am not sure what "top-post" means but I am not doing anything
different but using "reply to all" in Outlook. Please enlighten me. 😊The following link provides some more information:
https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.comAhhhh.....Ok... guess it's time to thank Microsoft then. ;) Noted I will try to do the "reduced" bottom-posting. I might slip up occasionally because it's an Intel habit. Is there a way to make Outlook do the leading ">" in a reply for the previous message?
Here is a blog post about how complex email posting can be:
https://momjian.us/main/blogs/pgblog/2023.html#September_8_2023
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
On Tue, Mar 05, 2024 at 04:52:23PM +0000, Amonson, Paul D wrote:
Noted I will try to do the "reduced" bottom-posting. I might slip up
occasionally because it's an Intel habit.
No worries.
Is there a way to make Outlook do the leading ">" in a reply for the
previous message?
I do not know, sorry. I personally use mutt for the lists.
BTW: Created the commit-fest submission.
Thanks. I intend to provide a more detailed review shortly, as I am aiming
to get this one committed for v17.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 2024-Mar-04, Amonson, Paul D wrote:
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62)) +#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) +<< 31))IME this means that the autoconf you are using has been patched. A
quick search on the mailing lists seems to indicate that it might be
specific to Debian [1].I am not sure what the ask is here? I made changes to the
configure.ac and ran autoconf2.69 to get builds to succeed. Do you
have a separate feedback here?
So what happens here is that autoconf-2.69 as shipped by Debian contains
some patches on top of the one released by GNU. We use the latter, so
if you run Debian's, then the generated configure script will contain
the differences coming from Debian's version.
Really, I don't think this is very important as a review point, because
if the configure.ac file is changed in the patch, it's best for the
committer to run autoconf on their own, using a pristine GNU autoconf;
the configure file in the submitted patch is not relevant, only
configure.ac matters.
What committers do (or should do) is keep an install of autoconf-2.69
straight from GNU.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Thu, Mar 07, 2024 at 06:53:12PM +0100, Alvaro Herrera wrote:
Really, I don't think this is very important as a review point, because
if the configure.ac file is changed in the patch, it's best for the
committer to run autoconf on their own, using a pristine GNU autoconf;
the configure file in the submitted patch is not relevant, only
configure.ac matters.
Agreed. I didn't intend for this to be a major review point, and I
apologize for the extra noise.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
As promised...
+# Check for Intel AVX512 intrinsics to do POPCNT calculations. +# +PGAC_AVX512_POPCNT_INTRINSICS([]) +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f]) +fi +AC_SUBST(CFLAGS_AVX512_POPCNT)
I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
machine, -mavx512vpopcntdq alone is enough to pass this test, so if there
are other instructions required that need -mavx512f, then we might need to
expand the test.
13 files changed, 657 insertions(+), 119 deletions(-)
I still think it's worth breaking this change into at least 2 patches. In
particular, I think there's an opportunity to do the refactoring into
pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512
stuff. These changes are likely straightforward, and getting them out of
the way early would make it easier to focus on the more interesting
changes. IMHO there are a lot of moving parts in this patch.
+#undef HAVE__GET_CPUID_COUNT + +/* Define to 1 if you have immintrin. */ +#undef HAVE__IMMINTRIN
Is this missing HAVE__CPUIDEX?
uint64 -pg_popcount(const char *buf, int bytes) +pg_popcount_slow(const char *buf, int bytes) { uint64 popcnt = 0;-#if SIZEOF_VOID_P >= 8 +#if SIZEOF_VOID_P == 8 /* Process in 64-bit chunks if the buffer is aligned. */ if (buf == (const char *) TYPEALIGN(8, buf)) { @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)buf = (const char *) words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{
Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.
+ /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + return popcnt; +#else + return pg_popcount_slow(buf, bytes); +#endif /* USE_AVX512_CODE */
nitpick: Could we call pg_popcount_slow() in a common section for these
"remaining bytes?"
+#if defined(_MSC_VER) + pg_popcount_indirect = pg_popcount512_fast; +#else + pg_popcount = pg_popcount512_fast; +#endif
These _MSC_VER sections are interesting. I'm assuming this is the
workaround for the MSVC linking issue you mentioned above. I haven't
looked too closely, but I wonder if the CRC32C code (see
src/include/port/pg_crc32c.h) is doing something different to avoid this
issue.
Upthread, Alvaro suggested a benchmark [0]/messages/by-id/202402071953.5c4z7t6kl7ts@alvherre.pgsql that might be useful. I scanned
through this thread and didn't see any recent benchmark results for the
latest form of the patch. I think it's worth verifying that we are still
seeing the expected improvements.
[0]: /messages/by-id/202402071953.5c4z7t6kl7ts@alvherre.pgsql
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Thursday, March 7, 2024 1:36 PM
Subject: Re: Popcount optimization using AVX512
I will be splitting the request into 2 patches. I am attaching the first patch (refactoring only) and I updated the commitfest entry to match this patch. I have a question however:
Do I need to wait for the refactor patch to be merged before I post the AVX portion of this feature in this thread?
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
other instructions required that need -mavx512f, then we might need to
expand the test.
First, nice catch on the required flags to build! When I changed my algorithm, dependence on the -mavx512f flag was no longer needed, In the second patch (AVX specific) I will fix this.
I still think it's worth breaking this change into at least 2 patches. In particular,
I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These
changes are likely straightforward, and getting them out of the way early
would make it easier to focus on the more interesting changes. IMHO there
are a lot of moving parts in this patch.
As stated above I am doing this in 2 patches. :)
+#undef HAVE__GET_CPUID_COUNT + +/* Define to 1 if you have immintrin. */ #undef HAVE__IMMINTRINIs this missing HAVE__CPUIDEX?
Yes I missed it, I will include in the second patch (AVX specific) of the 2 patches.
uint64 -pg_popcount(const char *buf, int bytes) +pg_popcount_slow(const char *buf, int bytes) { uint64 popcnt = 0;-#if SIZEOF_VOID_P >= 8 +#if SIZEOF_VOID_P == 8 /* Process in 64-bit chunks if the buffer is aligned. */ if (buf == (const char *) TYPEALIGN(8, buf)) { @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)buf = (const char *) words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.
I got rid of the second occurrence as I agree it is not needed but unless you see something I don't how to know which function to call between a 32-bit and 64-bit architecture? Maybe I am missing something obvious? What exactly do you suggest here? I am happy to always call either pg_popcount32() or pg_popcount64() with the understanding that it may not be optimal, but I do need to know which to use.
+ /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + return popcnt; +#else + return pg_popcount_slow(buf, bytes); +#endif /* USE_AVX512_CODE */nitpick: Could we call pg_popcount_slow() in a common section for these
"remaining bytes?"
Agreed, will fix in the second patch as well.
+#if defined(_MSC_VER) + pg_popcount_indirect = pg_popcount512_fast; #else + pg_popcount = pg_popcount512_fast; #endif
These _MSC_VER sections are interesting. I'm assuming this is the
workaround for the MSVC linking issue you mentioned above. I haven't
looked too closely, but I wonder if the CRC32C code (see
src/include/port/pg_crc32c.h) is doing something different to avoid this issue.
Using the latest master branch, I see what the needed changes are, I will implement using PGDLLIMPORT macro in the second patch.
Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned
through this thread and didn't see any recent benchmark results for the latest
form of the patch. I think it's worth verifying that we are still seeing the
expected improvements.
I will get new benchmarks using the same process I used before (from Akash) so I get apples to apples. These are pending completion of the second patch which is still in progress.
Just a reminder, I asked questions above about 1) multi-part dependent patches and, 2) What specifically to do about the SIZE_VOID_P checks. :)
Thanks,
Paul
Attachments:
v7-0001-Refactor-POPCNT.patchapplication/octet-stream; name=v7-0001-Refactor-POPCNT.patchDownload
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..12c56b0ba7 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcnt_choose.o \
+ pg_popcnt_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..ed8828c739 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcnt_choose.c',
+ 'pg_popcnt_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..5ace0a5b13 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -12,13 +12,6 @@
*/
#include "c.h"
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
@@ -103,123 +96,14 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount32_fast(uint32 word)
-{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-#endif /* TRY_POPCNT_FAST */
-
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +125,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
diff --git a/src/port/pg_popcnt_choose.c b/src/port/pg_popcnt_choose.c
new file mode 100644
index 0000000000..70c70f5742
--- /dev/null
+++ b/src/port/pg_popcnt_choose.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_choose.c
+ * For FAST operations, these methods do runtime checks and set the
+ * appropriate function pointers.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+
+/* In pg_bitutils.c file */
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+
+#ifdef TRY_POPCNT_FAST
+static bool pg_popcount_available(void);
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+
+/* In pg_popcnt_*_accel source file. */
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+#endif /* TRY_POPCNT_FAST */
+
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static void setup_function_pointers()
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ }
+}
+
+static int
+pg_popcount32_choose(uint32 word)
+{
+ setup_function_pointers();
+ return pg_popcount32(word);
+}
+
+static int
+pg_popcount64_choose(uint64 word)
+{
+ setup_function_pointers();
+ return pg_popcount64(word);
+}
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcnt_x86_64_accel.c b/src/port/pg_popcnt_x86_64_accel.c
new file mode 100644
index 0000000000..6e36e90e16
--- /dev/null
+++ b/src/port/pg_popcnt_x86_64_accel.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_accel.c
+ * Fast POPCNT methods for x86_64.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+
+#ifdef TRY_POPCNT_FAST
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
On Mon, Mar 11, 2024 at 09:59:53PM +0000, Amonson, Paul D wrote:
I will be splitting the request into 2 patches. I am attaching the first
patch (refactoring only) and I updated the commitfest entry to match this
patch. I have a question however:
Do I need to wait for the refactor patch to be merged before I post the
AVX portion of this feature in this thread?
Thanks. There's no need to wait to post the AVX portion. I recommend
using "git format-patch" to construct the patch set for the lists.
Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.I got rid of the second occurrence as I agree it is not needed but unless
you see something I don't how to know which function to call between a
32-bit and 64-bit architecture? Maybe I am missing something obvious?
What exactly do you suggest here? I am happy to always call either
pg_popcount32() or pg_popcount64() with the understanding that it may not
be optimal, but I do need to know which to use.
I'm recommending that we don't change any of the code in the pg_popcount()
function (which is renamed to pg_popcount_slow() in your v6 patch). If
pointers are 8 or more bytes, we'll try to process the buffer in 64-bit
chunks. Else, we'll try to process it in 32-bit chunks. Any remaining
bytes will be processed one-by-one.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
A couple of thoughts on v7-0001:
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+/* In pg_popcnt_*_accel source file. */
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
Can these prototypes be moved to a header file (maybe pg_bitutils.h)? It
looks like these are defined twice in the patch, and while I'm not positive
that it's against project policy to declare extern function prototypes in
.c files, it appears to be pretty rare.
+ 'pg_popcnt_choose.c',
+ 'pg_popcnt_x86_64_accel.c',
I think we want these to be architecture-specific, i.e., only built for
x86_64 if the compiler knows how to use the relevant instructions. There
is a good chance that we'll want to add similar support for other systems.
The CRC32C files are probably a good reference point for how to do this.
+#ifdef TRY_POPCNT_FAST
IIUC this macro can be set if either 1) the popcntq test in the
autoconf/meson scripts passes or 2) we're building with MSVC on x86_64. I
wonder if it would be better to move the MSVC/x86_64 check to the
autoconf/meson scripts so that we could avoid surrounding large portions of
the popcount code with this macro. This might even be a necessary step
towards building these files in an architecture-specific fashion.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, March 13, 2024 9:39 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
+extern int pg_popcount32_slow(uint32 word); extern int +pg_popcount64_slow(uint64 word);+/* In pg_popcnt_*_accel source file. */ extern int +pg_popcount32_fast(uint32 word); extern int pg_popcount64_fast(uint64 +word);Can these prototypes be moved to a header file (maybe pg_bitutils.h)? It
looks like these are defined twice in the patch, and while I'm not positive that
it's against project policy to declare extern function prototypes in .c files, it
appears to be pretty rare.
Originally, I intentionally did not put these in the header file as I want them to be private, but they are not defined in this .c file hence extern. Now I realize the "extern" part is not needed to accomplish my goal. Will fix by removing the "extern" keyword.
+ 'pg_popcnt_choose.c',
+ 'pg_popcnt_x86_64_accel.c',I think we want these to be architecture-specific, i.e., only built for
x86_64 if the compiler knows how to use the relevant instructions. There is a
good chance that we'll want to add similar support for other systems.
The CRC32C files are probably a good reference point for how to do this.
I will look at this for the 'pg_popcnt_x86_64_accel.c' file but the 'pg_popcnt_choose.c' file is intended to be for any platform that may need accelerators including a possible future ARM accelerator.
+#ifdef TRY_POPCNT_FAST
IIUC this macro can be set if either 1) the popcntq test in the autoconf/meson
scripts passes or 2) we're building with MSVC on x86_64. I wonder if it would
be better to move the MSVC/x86_64 check to the autoconf/meson scripts so
that we could avoid surrounding large portions of the popcount code with this
macro. This might even be a necessary step towards building these files in an
architecture-specific fashion.
I see the point here; however, this will take some time to get right especially since I don't have a Windows box to do compiles on. Should I attempt to do this in this patch?
Thanks,
Paul
On Wed, Mar 13, 2024 at 05:52:14PM +0000, Amonson, Paul D wrote:
I think we want these to be architecture-specific, i.e., only built for
x86_64 if the compiler knows how to use the relevant instructions. There is a
good chance that we'll want to add similar support for other systems.
The CRC32C files are probably a good reference point for how to do this.I will look at this for the 'pg_popcnt_x86_64_accel.c' file but the
'pg_popcnt_choose.c' file is intended to be for any platform that may
need accelerators including a possible future ARM accelerator.
I worry that using the same file for *_choose.c for all architectures would
become rather #ifdef heavy. Since we are already separating out this code
into new files, IMO we might as well try to avoid too many #ifdefs, too.
But this is admittedly less important right now because there's almost no
chance of any new architecture support here for v17.
+#ifdef TRY_POPCNT_FAST
IIUC this macro can be set if either 1) the popcntq test in the autoconf/meson
scripts passes or 2) we're building with MSVC on x86_64. I wonder if it would
be better to move the MSVC/x86_64 check to the autoconf/meson scripts so
that we could avoid surrounding large portions of the popcount code with this
macro. This might even be a necessary step towards building these files in an
architecture-specific fashion.I see the point here; however, this will take some time to get right
especially since I don't have a Windows box to do compiles on. Should I
attempt to do this in this patch?
This might also be less important given the absence of any imminent new
architecture support in this area. I'm okay with it, given we are just
maintaining the status quo.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Monday, March 11, 2024 6:35 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Thanks. There's no need to wait to post the AVX portion. I recommend using
"git format-patch" to construct the patch set for the lists.
After exploring git format-patch command I think I understand what you need. Attached.
What exactly do you suggest here? I am happy to always call either
pg_popcount32() or pg_popcount64() with the understanding that it may
not be optimal, but I do need to know which to use.I'm recommending that we don't change any of the code in the pg_popcount()
function (which is renamed to pg_popcount_slow() in your v6 patch). If
pointers are 8 or more bytes, we'll try to process the buffer in 64-bit chunks.
Else, we'll try to process it in 32-bit chunks. Any remaining bytes will be
processed one-by-one.
Ok, we are on the same page now. :) It is already fixed that way in the refactor patch #1.
As for new performance numbers: I just ran a full suite like I did earlier in the process. My latest results an equivalent to a pgbench scale factor 10 DB with the target column having varying column widths and appropriate random data are 1.2% improvement with a 2.2% Margin of Error at a 98% confidence level. Still seeing improvement and no regressions.
As stated in the previous separate chain I updated the code removing the extra "extern" keywords.
Thanks,
Paul
Attachments:
v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patchapplication/octet-stream; name=v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patchDownload
From 57256412ce11b006ad383fc689a0fd28716632e0 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Wed, 13 Mar 2024 11:56:44 -0700
Subject: [PATCH 1/2] [Refactor] POPCNT code refactored for future acceleration
of the function.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
src/port/Makefile | 2 +
src/port/meson.build | 2 +
src/port/pg_bitutils.c | 124 +-----------------------------
src/port/pg_popcnt_choose.c | 97 +++++++++++++++++++++++
src/port/pg_popcnt_x86_64_accel.c | 55 +++++++++++++
5 files changed, 160 insertions(+), 120 deletions(-)
create mode 100644 src/port/pg_popcnt_choose.c
create mode 100644 src/port/pg_popcnt_x86_64_accel.c
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..12c56b0ba7 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcnt_choose.o \
+ pg_popcnt_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..ed8828c739 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcnt_choose.c',
+ 'pg_popcnt_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..d8b045d0a4 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -12,13 +12,6 @@
*/
#include "c.h"
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
@@ -103,123 +96,14 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- }
-
- return pg_popcount64(word);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount32_fast(uint32 word)
-{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-#endif /* TRY_POPCNT_FAST */
-
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +125,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
diff --git a/src/port/pg_popcnt_choose.c b/src/port/pg_popcnt_choose.c
new file mode 100644
index 0000000000..89fcf2609c
--- /dev/null
+++ b/src/port/pg_popcnt_choose.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_choose.c
+ * For FAST operations, these methods do runtime checks and set the
+ * appropriate function pointers.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+
+/* In pg_bitutils.c file */
+int pg_popcount32_slow(uint32 word);
+int pg_popcount64_slow(uint64 word);
+
+#ifdef TRY_POPCNT_FAST
+static bool pg_popcount_available(void);
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+
+/* In pg_popcnt_*_accel source file. */
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+#endif /* TRY_POPCNT_FAST */
+
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static void setup_function_pointers()
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ }
+}
+
+static int
+pg_popcount32_choose(uint32 word)
+{
+ setup_function_pointers();
+ return pg_popcount32(word);
+}
+
+static int
+pg_popcount64_choose(uint64 word)
+{
+ setup_function_pointers();
+ return pg_popcount64(word);
+}
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcnt_x86_64_accel.c b/src/port/pg_popcnt_x86_64_accel.c
new file mode 100644
index 0000000000..2e9b2ee774
--- /dev/null
+++ b/src/port/pg_popcnt_x86_64_accel.c
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcnt_x86_64_accel.c
+ * Fast POPCNT methods for x86_64.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcnt_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+
+#ifdef TRY_POPCNT_FAST
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patchapplication/octet-stream; name=v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patchDownload
From 67a887d8fe8be389c709fdc87b196cd2ad5d2bf7 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Wed, 13 Mar 2024 12:49:57 -0700
Subject: [PATCH 2/2] [Feat] Add AVX-512 POPCNT support initial checkin.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
config/c-compiler.m4 | 37 ++++++
configure | 205 ++++++++++++++++++++++++++++++
configure.ac | 44 +++++++
meson.build | 72 +++++++++++
src/Makefile.global.in | 1 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 10 +-
src/makefiles/meson.build | 1 +
src/port/Makefile | 5 +
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 9 +-
src/port/pg_popcnt_choose.c | 67 +++++++++-
src/port/pg_popcnt_x86_64_accel.c | 36 +++++-
13 files changed, 494 insertions(+), 11 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..54f7415e5a 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,40 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>],
+ [const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
\ No newline at end of file
diff --git a/configure b/configure
index 36feeafbb2..0fbfc7c78f 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,175 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+
+$as_echo "#define HAVE__AVX512_POPCNT 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..3c741d457d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,38 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ AC_DEFINE(HAVE__AVX512_POPCNT, 1, [Define to 1 if you have cpu
+ support for AVX512 POPCNT.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 85788f9dd8..39480b4251 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2147,6 +2178,47 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ #include <stdlib.h>
+ #include <string.h>
+ void main(void)
+ {
+ const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq flag.',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
# Library / OS tests
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..e4d56dee79 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,6 +555,18 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have __get_cpuidex. */
+#undef HAVE__GET_CPUIDEX
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
+/* Define to 1 if you have AVX512. */
+#undef HAVE__AVX512_POPCNT
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..cc42ce49c9 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -300,19 +300,19 @@ pg_ceil_log2_64(uint64 num)
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
-extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
-extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-
+extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
+extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-#endif /* TRY_POPCNT_FAST */
-
/* Count the number of one-bits in a byte array */
extern uint64 pg_popcount(const char *buf, int bytes);
+#endif /* TRY_POPCNT_FAST */
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index 12c56b0ba7..0b76926301 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_popcnt_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcnt_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index ed8828c739..d7930672cb 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_popcnt_choose.c',
- 'pg_popcnt_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -86,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcnt_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index d8b045d0a4..22c51c1679 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -98,6 +98,7 @@ const uint8 pg_number_of_ones[256] = {
int pg_popcount32_slow(uint32 word);
int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
/*
* pg_popcount32_slow
@@ -170,6 +171,12 @@ pg_popcount64(uint64 word)
return pg_popcount64_slow(word);
}
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
#endif /* !TRY_POPCNT_FAST */
/*
@@ -177,7 +184,7 @@ pg_popcount64(uint64 word)
* Returns the number of 1-bits in buf
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcnt_choose.c b/src/port/pg_popcnt_choose.c
index 89fcf2609c..ac1344415d 100644
--- a/src/port/pg_popcnt_choose.c
+++ b/src/port/pg_popcnt_choose.c
@@ -26,18 +26,23 @@
/* In pg_bitutils.c file */
int pg_popcount32_slow(uint32 word);
int pg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
+static bool pg_popcount512_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
/* In pg_popcnt_*_accel source file. */
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -61,6 +66,52 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -69,15 +120,23 @@ pg_popcount_available(void)
*/
static void setup_function_pointers()
{
- if (pg_popcount_available())
+ if (pg_popcount512_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_slow;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
}
@@ -94,4 +153,10 @@ pg_popcount64_choose(uint64 word)
setup_function_pointers();
return pg_popcount64(word);
}
+
+static uint64
+pg_popcount_choose(const char* buf, int bytes) {
+ setup_function_pointers();
+ return pg_popcount(buf, bytes);
+}
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcnt_x86_64_accel.c b/src/port/pg_popcnt_x86_64_accel.c
index 2e9b2ee774..ecc07afd37 100644
--- a/src/port/pg_popcnt_x86_64_accel.c
+++ b/src/port/pg_popcnt_x86_64_accel.c
@@ -14,8 +14,14 @@
#include "port/pg_bitutils.h"
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
/*
@@ -52,4 +58,32 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
-#endif /* TRY_POPCNT_FAST */
+/*
+ * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
+ * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ #if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+#endif /* defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1 */
+
+ /* Process any remaining bytes */
+ return popcnt + pg_popcount_slow(buf, bytes);
+}
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
On Thu, Mar 14, 2024 at 07:50:46PM +0000, Amonson, Paul D wrote:
As for new performance numbers: I just ran a full suite like I did
earlier in the process. My latest results an equivalent to a pgbench
scale factor 10 DB with the target column having varying column widths
and appropriate random data are 1.2% improvement with a 2.2% Margin of
Error at a 98% confidence level. Still seeing improvement and no
regressions.
Which test suite did you run? Those numbers seem potentially
indistinguishable from noise, which probably isn't great for such a large
patch set.
I ran John Naylor's test_popcount module [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com with the following command on
an i7-1195G7:
time psql postgres -c 'select drive_popcount(10000000, 1024)'
Without your patches, this seems to take somewhere around 8.8 seconds.
With your patches, it takes 0.6 seconds. (I re-compiled and re-ran the
tests a couple of times because I had a difficult time believing the amount
of improvement.)
[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Friday, March 15, 2024 8:06 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-
ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Noah Misch
<noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de
Meent <boekewurm+postgres@gmail.com>; pgsql-
hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512Which test suite did you run? Those numbers seem potentially
indistinguishable from noise, which probably isn't great for such a large patch
set.
I ran...
psql -c "select bitcount(column) from table;"
...in a loop with "column" widths of 84, 4096, 8192, and 16384 containing random data. There DB has 1 million rows. In the loop before calling the select I have code to clear all system caches. If I omit the code to clear system caches the margin of error remains the same but the improvement percent changes from 1.2% to 14.6% (much less I/O when cached data is available).
I ran John Naylor's test_popcount module [0] with the following command on
an i7-1195G7:time psql postgres -c 'select drive_popcount(10000000, 1024)'
Without your patches, this seems to take somewhere around 8.8 seconds.
With your patches, it takes 0.6 seconds. (I re-compiled and re-ran the tests a
couple of times because I had a difficult time believing the amount of
improvement.)
When I tested the code outside postgres in a micro benchmark I got 200-300% improvements. Your results are interesting, as it implies more than 300% improvement. Let me do some research on the benchmark you referenced. However, in all cases it seems that there is no regression so should we move forward on merging while I run some more local tests?
Thanks,
Paul
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Friday, March 15, 2024 8:31 AM
To: Nathan Bossart <nathandbossart@gmail.com>
...
When I tested the code outside postgres in a micro benchmark I got 200-
300% improvements. Your results are interesting, as it implies more than
300% improvement. Let me do some research on the benchmark you
referenced. However, in all cases it seems that there is no regression so should
we move forward on merging while I run some more local tests?
When running quick test with small buffers (1 to 32K) I see up to about a 740% improvement. This was using my stand-alone micro benchmark outside of PG. My original 200-300% numbers were averaged including sizes up to 512MB which seems to not run as well on large buffers. I will try the referenced micro benchmark on Monday. None of my benchmark testing used the command line "time" command. For Postgres is set "\timing" before the run and for the stand-alone benchmark is took timestamps in the code. In all cases I used -O2 for optimization.
Thanks,
Paul
On Sat, 16 Mar 2024 at 04:06, Nathan Bossart <nathandbossart@gmail.com> wrote:
I ran John Naylor's test_popcount module [0] with the following command on
an i7-1195G7:time psql postgres -c 'select drive_popcount(10000000, 1024)'
Without your patches, this seems to take somewhere around 8.8 seconds.
With your patches, it takes 0.6 seconds. (I re-compiled and re-ran the
tests a couple of times because I had a difficult time believing the amount
of improvement.)[0] /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
I think most of that will come from getting rid of the indirect
function that currently exists in pg_popcount().
Using the attached quick hack, the performance using John's test
module goes from:
-- master
postgres=# select drive_popcount(10000000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)
-- with attached hacky and untested patch
postgres=# select drive_popcount(10000000, 1024);
Time: 2539.029 ms (00:02.539)
Time: 2598.223 ms (00:02.598)
Time: 2611.435 ms (00:02.611)
--- and with the avx512 patch on an AMD 7945HX CPU:
postgres=# select drive_popcount(10000000, 1024);
Time: 564.982 ms
Time: 556.540 ms
Time: 554.032 ms
The following comment seems like it could do with some improvements.
* Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
* loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
* methods are used this will fall back to __builtin_* or pure software.
There's nothing much specific to Intel here. AMD Zen4 has AVX512.
Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the"
How about just:
* Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in
* pg_popcount_slow() when AVX-512 is unavailable.
Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff. It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.
David
Attachments:
remove_indirect_func_call_in_pg_popcount.patch.txttext/plain; charset=US-ASCII; name=remove_indirect_func_call_in_pg_popcount.patch.txtDownload
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..85e45cee9b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -305,7 +305,18 @@ pg_popcount(const char *buf, int bytes)
while (bytes >= 8)
{
- popcnt += pg_popcount64(*words++);
+#ifdef _MSC_VER
+ popcnt += __popcnt64(*words++);
+#else
+ uint64 res;
+
+ __asm__ __volatile__(" popcntq %1,%0\n"
+ : "=q"(res)
+ : "rm"(word)
+ : "cc");
+ popcnt += (int) res;
+ words++;
+#endif
bytes -= 8;
}
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..ae880db64c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -26,6 +26,7 @@ subdir('test_misc')
subdir('test_oat_hooks')
subdir('test_parser')
subdir('test_pg_dump')
+subdir('test_popcount')
subdir('test_predtest')
subdir('test_radixtree')
subdir('test_rbtree')
On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff. It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.
+1
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Won't I still need the runtime checks? If I compile with a compiler supporting the HW "feature" but run on HW without that feature, I will want to avoid faults due to illegal operations. Won't that also affect performance?
Paul
Show quoted text
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Monday, March 18, 2024 8:29 AM
To: David Rowley <dgrowleyml@gmail.com>
Cc: Amonson, Paul D <paul.d.amonson@intel.com>; Andres Freund
<andres@anarazel.de>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran,
Akash <akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>;
Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent
<boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff. It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.+1
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 18, 2024 at 04:07:40PM +0000, Amonson, Paul D wrote:
Won't I still need the runtime checks? If I compile with a compiler
supporting the HW "feature" but run on HW without that feature, I will
want to avoid faults due to illegal operations. Won't that also affect
performance?
I don't think David was suggesting that we need to remove the runtime
checks for AVX512. IIUC he was pointing out that most of the performance
gain is from removing the function call overhead, which your v8-0002 patch
already does for the proposed AVX512 code. We can apply a similar
optimization for systems without AVX512 by inlining the code for
pg_popcount64() and pg_popcount32().
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Monday, March 18, 2024 9:20 AM
...
I don't think David was suggesting that we need to remove the runtime checks
for AVX512. IIUC he was pointing out that most of the performance gain is
from removing the function call overhead, which your v8-0002 patch already
does for the proposed AVX512 code. We can apply a similar optimization for
systems without AVX512 by inlining the code for
pg_popcount64() and pg_popcount32().
Ok, got you.
Question: I applied the patch for the drive_popcount* functions and rebuilt. The resultant server complains that the function is missing. What is the trick to make this work?
Another Question: Is there a reason "time psql" is used over the Postgres "\timing" command?
Thanks,
Paul
On Mon, Mar 18, 2024 at 11:20:18AM -0500, Nathan Bossart wrote:
I don't think David was suggesting that we need to remove the runtime
checks for AVX512. IIUC he was pointing out that most of the performance
gain is from removing the function call overhead, which your v8-0002 patch
already does for the proposed AVX512 code. We can apply a similar
optimization for systems without AVX512 by inlining the code for
pg_popcount64() and pg_popcount32().
Here is a more fleshed-out version of what I believe David is proposing.
On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
test_popcount benchmark). I assume this is because this patch turns
pg_popcount() into a function pointer, which is what the AVX512 patches do,
too. I left out the 32-bit section from pg_popcount_fast(), but I'll admit
that I'm not yet 100% sure that we can assume we're on a 64-bit system
there.
IMHO this work is arguably a prerequisite for the AVX512 work, as turning
pg_popcount() into a function pointer will likely regress performance for
folks on systems without AVX512 otherwise.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-inline-function-calls-in-pg_popcount-when-possible.patchtext/x-diff; charset=us-asciiDownload
From 1d33c803feb7428f798b13fd643a16c73628f8a9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible
---
src/include/port/pg_bitutils.h | 5 +-
src/port/pg_bitutils.c | 123 +++++++++++++++++++++++++--------
2 files changed, 97 insertions(+), 31 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..e374e753d7 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -105,16 +105,20 @@ const uint8 pg_number_of_ones[256] = {
static int pg_popcount32_slow(uint32 word);
static int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount64(word);
}
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+
+ return pg_popcount(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount64_fast
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_fast(uint64 word)
{
#ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+static uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -265,35 +322,12 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
-#ifndef TRY_POPCNT_FAST
-
/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-#endif /* !TRY_POPCNT_FAST */
-
-/*
- * pg_popcount
+ * pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-uint64
-pg_popcount(const char *buf, int bytes)
+static uint64
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -333,3 +367,36 @@ pg_popcount(const char *buf, int bytes)
return popcnt;
}
+
+#ifndef TRY_POPCNT_FAST
+
+/*
+ * When the POPCNT instruction is not available, there's no point in using
+ * function pointers to vary the implementation between the fast and slow
+ * method. We instead just make these actual external functions when
+ * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
+ * the slow versions here.
+ */
+int
+pg_popcount32(uint32 word)
+{
+ return pg_popcount32_slow(word);
+}
+
+int
+pg_popcount64(uint64 word)
+{
+ return pg_popcount64_slow(word);
+}
+
+/*
+ * pg_popcount
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
+#endif /* !TRY_POPCNT_FAST */
--
2.25.1
On Mon, Mar 18, 2024 at 05:28:32PM +0000, Amonson, Paul D wrote:
Question: I applied the patch for the drive_popcount* functions and
rebuilt. The resultant server complains that the function is missing.
What is the trick to make this work?
You probably need to install the test_popcount extension and run "CREATE
EXTENION test_popcount;".
Another Question: Is there a reason "time psql" is used over the Postgres
"\timing" command?
I don't think there's any strong reason. I've used both.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 18, 2024 at 12:30:04PM -0500, Nathan Bossart wrote:
Here is a more fleshed-out version of what I believe David is proposing.
On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
test_popcount benchmark). I assume this is because this patch turns
pg_popcount() into a function pointer, which is what the AVX512 patches do,
too. I left out the 32-bit section from pg_popcount_fast(), but I'll admit
that I'm not yet 100% sure that we can assume we're on a 64-bit system
there.IMHO this work is arguably a prerequisite for the AVX512 work, as turning
pg_popcount() into a function pointer will likely regress performance for
folks on systems without AVX512 otherwise.
Apologies for the noise. I noticed that we could (and probably should)
inline the pg_popcount32/64 calls in the "slow" version, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-inline-function-calls-in-pg_popcount-when-possible.patchtext/x-diff; charset=us-asciiDownload
From 3047674f0950435b7fa30746be7f8e5cc7249e6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible
---
src/include/port/pg_bitutils.h | 5 +-
src/port/pg_bitutils.c | 135 ++++++++++++++++++++++++---------
2 files changed, 103 insertions(+), 37 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..d0c93dafcb 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount64(word);
}
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+
+ return pg_popcount(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount64_fast
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_fast(uint64 word)
{
#ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+static uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -219,7 +276,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +298,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -265,35 +322,12 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
-#ifndef TRY_POPCNT_FAST
-
/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-#endif /* !TRY_POPCNT_FAST */
-
-/*
- * pg_popcount
+ * pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-uint64
-pg_popcount(const char *buf, int bytes)
+static uint64
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -305,7 +339,7 @@ pg_popcount(const char *buf, int bytes)
while (bytes >= 8)
{
- popcnt += pg_popcount64(*words++);
+ popcnt += pg_popcount64_slow(*words++);
bytes -= 8;
}
@@ -319,7 +353,7 @@ pg_popcount(const char *buf, int bytes)
while (bytes >= 4)
{
- popcnt += pg_popcount32(*words++);
+ popcnt += pg_popcount32_slow(*words++);
bytes -= 4;
}
@@ -333,3 +367,36 @@ pg_popcount(const char *buf, int bytes)
return popcnt;
}
+
+#ifndef TRY_POPCNT_FAST
+
+/*
+ * When the POPCNT instruction is not available, there's no point in using
+ * function pointers to vary the implementation between the fast and slow
+ * method. We instead just make these actual external functions when
+ * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
+ * the slow versions here.
+ */
+int
+pg_popcount32(uint32 word)
+{
+ return pg_popcount32_slow(word);
+}
+
+int
+pg_popcount64(uint64 word)
+{
+ return pg_popcount64_slow(word);
+}
+
+/*
+ * pg_popcount
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
+#endif /* !TRY_POPCNT_FAST */
--
2.25.1
On Tue, 19 Mar 2024 at 06:30, Nathan Bossart <nathandbossart@gmail.com> wrote:
Here is a more fleshed-out version of what I believe David is proposing.
On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
test_popcount benchmark). I assume this is because this patch turns
pg_popcount() into a function pointer, which is what the AVX512 patches do,
too. I left out the 32-bit section from pg_popcount_fast(), but I'll admit
that I'm not yet 100% sure that we can assume we're on a 64-bit system
there.
I looked at your latest patch and tried out the performance on a Zen4
running windows and a Zen2 running on Linux. As follows:
AMD 3990x:
master:
postgres=# select drive_popcount(10000000, 1024);
Time: 11904.078 ms (00:11.904)
Time: 11907.176 ms (00:11.907)
Time: 11927.983 ms (00:11.928)
patched:
postgres=# select drive_popcount(10000000, 1024);
Time: 3641.271 ms (00:03.641)
Time: 3610.934 ms (00:03.611)
Time: 3663.423 ms (00:03.663)
AMD 7945HX Windows
master:
postgres=# select drive_popcount(10000000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)
patched:
postgres=# select drive_popcount(10000000, 1024);
Time: 3427.942 ms (00:03.428)
Time: 3364.262 ms (00:03.364)
Time: 3413.407 ms (00:03.413)
The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().
IMHO this work is arguably a prerequisite for the AVX512 work, as turning
pg_popcount() into a function pointer will likely regress performance for
folks on systems without AVX512 otherwise.
I think so too.
David
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
I looked at your latest patch and tried out the performance on a Zen4
running windows and a Zen2 running on Linux. As follows:
Thanks for taking a look.
The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().
The only reason I left it out was because I couldn't convince myself that
it wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today. But I don't see any harm in adding that just
in case.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Monday, March 18, 2024 2:08 PM
To: David Rowley <dgrowleyml@gmail.com>
Cc: Amonson, Paul D <paul.d.amonson@intel.com>; Andres Freund
...The only reason I left it out was because I couldn't convince myself that it
wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today. But I don't see any harm in adding that just in
case.
I am not sure how to read this. Does this mean that for popcount32_fast and popcount64_fast I can assume that the x86(_64) instructions exists and stop doing the runtime checks for instruction availability?
Thanks,
Paul
On Mon, Mar 18, 2024 at 09:22:43PM +0000, Amonson, Paul D wrote:
The only reason I left it out was because I couldn't convince myself that it
wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today. But I don't see any harm in adding that just in
case.I am not sure how to read this. Does this mean that for popcount32_fast
and popcount64_fast I can assume that the x86(_64) instructions exists
and stop doing the runtime checks for instruction availability?
I think my question boils down to "if pg_popcount_available() returns true,
can I safely assume I'm on a 64-bit machine?"
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, 19 Mar 2024 at 10:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().The only reason I left it out was because I couldn't convince myself that
it wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today. But I don't see any harm in adding that just
in case.
It's probably more of a case of using native instructions rather than
ones that might be implemented only via microcode. For the record, I
don't know if that would be the case for popcntq on x86 32-bit and I
don't have the hardware to test it. It just seems less risky just to
do it.
David
On Tue, Mar 19, 2024 at 10:27:58AM +1300, David Rowley wrote:
On Tue, 19 Mar 2024 at 10:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().The only reason I left it out was because I couldn't convince myself that
it wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today. But I don't see any harm in adding that just
in case.It's probably more of a case of using native instructions rather than
ones that might be implemented only via microcode. For the record, I
don't know if that would be the case for popcntq on x86 32-bit and I
don't have the hardware to test it. It just seems less risky just to
do it.
Agreed. Will send an updated patch shortly.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
Agreed. Will send an updated patch shortly.
As promised...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-inline-function-calls-in-pg_popcount-when-possibl.patchtext/x-diff; charset=us-asciiDownload
From b673663b1d1344549cbd0912220f96ba1712afc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH v4 1/1] inline function calls in pg_popcount() when possible
---
src/include/port/pg_bitutils.h | 5 +-
src/port/pg_bitutils.c | 155 +++++++++++++++++++++++++--------
2 files changed, 121 insertions(+), 39 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..1197696e97 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
-static int pg_popcount32_fast(uint32 word);
-static int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static inline int pg_popcount32_fast(uint32 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount32(word);
@@ -168,21 +174,42 @@ pg_popcount64_choose(uint64 word)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
}
return pg_popcount64(word);
}
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+
+ return pg_popcount(buf, bytes);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount32_fast(uint32 word)
{
#ifdef _MSC_VER
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount64_fast
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_fast(uint64 word)
{
#ifdef _MSC_VER
@@ -212,6 +239,52 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
#endif
}
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+static uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -219,7 +292,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -241,7 +314,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static int
+static inline int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -265,35 +338,12 @@ pg_popcount64_slow(uint64 word)
#endif /* HAVE__BUILTIN_POPCOUNT */
}
-#ifndef TRY_POPCNT_FAST
-
/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-#endif /* !TRY_POPCNT_FAST */
-
-/*
- * pg_popcount
+ * pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-uint64
-pg_popcount(const char *buf, int bytes)
+static uint64
+pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -305,7 +355,7 @@ pg_popcount(const char *buf, int bytes)
while (bytes >= 8)
{
- popcnt += pg_popcount64(*words++);
+ popcnt += pg_popcount64_slow(*words++);
bytes -= 8;
}
@@ -319,7 +369,7 @@ pg_popcount(const char *buf, int bytes)
while (bytes >= 4)
{
- popcnt += pg_popcount32(*words++);
+ popcnt += pg_popcount32_slow(*words++);
bytes -= 4;
}
@@ -333,3 +383,36 @@ pg_popcount(const char *buf, int bytes)
return popcnt;
}
+
+#ifndef TRY_POPCNT_FAST
+
+/*
+ * When the POPCNT instruction is not available, there's no point in using
+ * function pointers to vary the implementation between the fast and slow
+ * method. We instead just make these actual external functions when
+ * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
+ * the slow versions here.
+ */
+int
+pg_popcount32(uint32 word)
+{
+ return pg_popcount32_slow(word);
+}
+
+int
+pg_popcount64(uint64 word)
+{
+ return pg_popcount64_slow(word);
+}
+
+/*
+ * pg_popcount
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount(const char *buf, int bytes)
+{
+ return pg_popcount_slow(buf, bytes);
+}
+
+#endif /* !TRY_POPCNT_FAST */
--
2.25.1
On Tue, 19 Mar 2024 at 11:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
Agreed. Will send an updated patch shortly.
As promised...
Looks good.
David
On Tue, Mar 19, 2024 at 12:30:50PM +1300, David Rowley wrote:
Looks good.
Committed. Thanks for the suggestion and for reviewing!
Paul, I suspect your patches will need to be rebased after commit cc4826d.
Would you mind doing so?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>Committed. Thanks for the suggestion and for reviewing!
Paul, I suspect your patches will need to be rebased after commit cc4826d.
Would you mind doing so?
Changed in this patch set.
* Rebased.
* Direct *slow* calls via macros as shown in example patch.
* Changed the choose filename to be platform specific as suggested.
* Falls back to intermediate "Fast" methods if AVX512 is not available at runtime.
* inline used where is makes sense, remember using "extern" negates "inline".
* Fixed comment issues pointed out in review.
I tested building with and without TRY_POPCOUNT_FAST, for both configure and meson build systems, and ran in CI.
Thanks,
Paul
Attachments:
v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patchapplication/octet-stream; name=v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patchDownload
From e9483d2354bd58210805b17fa78091ecc007c612 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:35:53 -0700
Subject: [PATCH 1/3] [Refactor] inlining and direct calls for *_slow
functions.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
src/backend/access/heap/visibilitymap.c | 6 +-
src/include/nodes/bitmapset.h | 4 +-
src/include/port/pg_bitutils.h | 17 +++--
src/port/pg_bitutils.c | 90 +++++--------------------
4 files changed, 34 insertions(+), 83 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..28dc497b79 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -419,14 +419,14 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
if (all_frozen == NULL)
{
for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+ nvisible += PG_POPCOUNT64(map[i] & VISIBLE_MASK64);
}
else
{
for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
{
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
+ nvisible += PG_POPCOUNT64(map[i] & VISIBLE_MASK64);
+ nfrozen += PG_POPCOUNT64(map[i] & FROZEN_MASK64);
}
}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 283bea5ea9..b5631d153e 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -77,11 +77,11 @@ typedef enum
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
+#define bmw_popcount(w) PG_POPCOUNT32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
+#define bmw_popcount(w) PG_POPCOUNT64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..477e00e0da 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -300,16 +300,23 @@ pg_ceil_log2_64(uint64 num)
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
-
+#define PG_POPCOUNT32(x) pg_popcount32(x)
+#define PG_POPCOUNT64(x) pg_popcount64(x)
+#define PG_POPCOUNT(x,y) pg_popcount(x,y)
#else
/* Use a portable implementation -- no need for a function pointer. */
-extern int pg_popcount32(uint32 word);
-extern int pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
-
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+#define PG_POPCOUNT32(x) pg_popcount32_slow(x)
+#define PG_POPCOUNT64(x) pg_popcount64_slow(x)
+#define PG_POPCOUNT(x,y) pg_popcount_slow(x,y)
#endif /* TRY_POPCNT_FAST */
/*
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..e629969035 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,9 +103,9 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static inline int pg_popcount32_slow(uint32 word);
-static inline int pg_popcount64_slow(uint64 word);
-static uint64 pg_popcount_slow(const char *buf, int bytes);
+// static inline int pg_popcount32_slow(uint32 word);
+// static inline int pg_popcount64_slow(uint64 word);
+// static uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
@@ -119,9 +119,6 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
/*
* Return true if CPUID indicates that the POPCNT instruction is available.
@@ -148,8 +145,7 @@ pg_popcount_available(void)
* the function pointers so that subsequent calls are routed directly to
* the chosen implementation.
*/
-static int
-pg_popcount32_choose(uint32 word)
+static inline void set_function_pointers()
{
if (pg_popcount_available())
{
@@ -163,45 +159,26 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+static int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ set_function_pointers();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ set_function_pointers();
return pg_popcount(buf, bytes);
}
@@ -243,7 +220,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+static inline uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -256,7 +233,7 @@ pg_popcount_fast(const char *buf, int bytes)
while (bytes >= 8)
{
- popcnt += pg_popcount64_fast(*words++);
+ popcnt += PG_POPCOUNT64(*words++);
bytes -= 8;
}
@@ -270,7 +247,7 @@ pg_popcount_fast(const char *buf, int bytes)
while (bytes >= 4)
{
- popcnt += pg_popcount32_fast(*words++);
+ popcnt += PG_POPCOUNT32(*words++);
bytes -= 4;
}
@@ -292,7 +269,7 @@ pg_popcount_fast(const char *buf, int bytes)
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static inline int
+inline int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -314,7 +291,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static inline int
+inline int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -342,7 +319,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -383,36 +360,3 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
-
-#ifndef TRY_POPCNT_FAST
-
-/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-/*
- * pg_popcount
- * Returns the number of 1-bits in buf
- */
-uint64
-pg_popcount(const char *buf, int bytes)
-{
- return pg_popcount_slow(buf, bytes);
-}
-
-#endif /* !TRY_POPCNT_FAST */
--
2.34.1
v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patchapplication/octet-stream; name=v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patchDownload
From e5e4cac323b913e9fcdbd17d6b07316a21f7ff5c Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:37:31 -0700
Subject: [PATCH 2/3] [Refactor] Seperated slow, fast, and choose functionality
into files.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
contrib/intarray/_intbig_gist.c | 2 +-
contrib/ltree/_ltree_gist.c | 2 +-
contrib/pageinspect/heapfuncs.c | 4 +-
contrib/pg_trgm/trgm_gist.c | 2 +-
contrib/pg_walinspect/pg_walinspect.c | 2 +-
src/backend/lib/bloomfilter.c | 2 +-
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/tsgistidx.c | 2 +-
src/backend/utils/adt/varbit.c | 2 +-
src/backend/utils/adt/varlena.c | 2 +-
src/port/Makefile | 2 +
src/port/meson.build | 2 +
src/port/pg_bitutils.c | 171 +-------------------------
src/port/pg_popcount_x86_64_accel.c | 101 +++++++++++++++
src/port/pg_popcount_x86_64_choose.c | 98 +++++++++++++++
15 files changed, 215 insertions(+), 181 deletions(-)
create mode 100644 src/port/pg_popcount_x86_64_accel.c
create mode 100644 src/port/pg_popcount_x86_64_choose.c
diff --git a/contrib/intarray/_intbig_gist.c b/contrib/intarray/_intbig_gist.c
index 9699fbf3b4..a12ea7ed9b 100644
--- a/contrib/intarray/_intbig_gist.c
+++ b/contrib/intarray/_intbig_gist.c
@@ -210,7 +210,7 @@ g_intbig_compress(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/contrib/ltree/_ltree_gist.c b/contrib/ltree/_ltree_gist.c
index e89a39a5b5..bd66ec2e65 100644
--- a/contrib/ltree/_ltree_gist.c
+++ b/contrib/ltree/_ltree_gist.c
@@ -180,7 +180,7 @@ _ltree_union(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount((const char *) sign, siglen);
+ return PG_POPCOUNT((const char *) sign, siglen);
}
static int
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 3faeabc711..089842962f 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -527,8 +527,8 @@ heap_tuple_infomask_flags(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- bitcnt = pg_popcount((const char *) &t_infomask, sizeof(uint16)) +
- pg_popcount((const char *) &t_infomask2, sizeof(uint16));
+ bitcnt = PG_POPCOUNT((const char *) &t_infomask, sizeof(uint16)) +
+ PG_POPCOUNT((const char *) &t_infomask2, sizeof(uint16));
/* If no flags, return a set of empty arrays */
if (bitcnt <= 0)
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 9ef2e38560..850316196f 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -648,7 +648,7 @@ gtrgm_same(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index ee2918726d..93a7b4842a 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -303,7 +303,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
block_fpi_len = blk->bimg_len;
/* Construct and save block_fpi_info */
- bitcnt = pg_popcount((const char *) &blk->bimg_info,
+ bitcnt = PG_POPCOUNT((const char *) &blk->bimg_info,
sizeof(uint8));
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
if ((blk->bimg_info & BKPIMAGE_HAS_HOLE) != 0)
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
index 360d21ca45..c01b069c01 100644
--- a/src/backend/lib/bloomfilter.c
+++ b/src/backend/lib/bloomfilter.c
@@ -187,7 +187,7 @@ double
bloom_prop_bits_set(bloom_filter *filter)
{
int bitset_bytes = filter->m / BITS_PER_BYTE;
- uint64 bits_set = pg_popcount((char *) filter->bitset, bitset_bytes);
+ uint64 bits_set = PG_POPCOUNT((char *) filter->bitset, bitset_bytes);
return bits_set / (double) filter->m;
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..85c57b3154 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
p.pid != 0 &&
- pg_popcount((char *) &dest_flags, 1) == 1)
+ PG_POPCOUNT((char *) &dest_flags, 1) == 1)
{
List *buffer_list;
ListCell *cell;
diff --git a/src/backend/utils/adt/tsgistidx.c b/src/backend/utils/adt/tsgistidx.c
index 5698ee5502..d7a76faf31 100644
--- a/src/backend/utils/adt/tsgistidx.c
+++ b/src/backend/utils/adt/tsgistidx.c
@@ -489,7 +489,7 @@ gtsvector_same(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/src/backend/utils/adt/varbit.c b/src/backend/utils/adt/varbit.c
index 8fcf3fb731..3f287cd54d 100644
--- a/src/backend/utils/adt/varbit.c
+++ b/src/backend/utils/adt/varbit.c
@@ -1212,7 +1212,7 @@ bit_bit_count(PG_FUNCTION_ARGS)
{
VarBit *arg = PG_GETARG_VARBIT_P(0);
- PG_RETURN_INT64(pg_popcount((char *) VARBITS(arg), VARBITBYTES(arg)));
+ PG_RETURN_INT64(PG_POPCOUNT((char *) VARBITS(arg), VARBITBYTES(arg)));
}
/*
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8d28dd42ce..809e6a59ab 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -3151,7 +3151,7 @@ bytea_bit_count(PG_FUNCTION_ARGS)
{
bytea *t1 = PG_GETARG_BYTEA_PP(0);
- PG_RETURN_INT64(pg_popcount(VARDATA_ANY(t1), VARSIZE_ANY_EXHDR(t1)));
+ PG_RETURN_INT64(PG_POPCOUNT(VARDATA_ANY(t1), VARSIZE_ANY_EXHDR(t1)));
}
/*
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..1499985dfc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_x86_64_choose.o \
+ pg_popcount_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..cf6e9fa06c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_x86_64_choose.c',
+ 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index e629969035..f08820b35b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -11,14 +11,6 @@
*-------------------------------------------------------------------------
*/
#include "c.h"
-
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
@@ -103,167 +95,6 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-// static inline int pg_popcount32_slow(uint32 word);
-// static inline int pg_popcount64_slow(uint64 word);
-// static uint64 pg_popcount_slow(const char *buf, int bytes);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static uint64 pg_popcount_choose(const char *buf, int bytes);
-static inline int pg_popcount32_fast(uint32 word);
-static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static inline void set_function_pointers()
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-}
-
-static int
-pg_popcount32_choose(uint32 word)
-{
- set_function_pointers();
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- set_function_pointers();
- return pg_popcount64(word);
-}
-
-static uint64
-pg_popcount_choose(const char *buf, int bytes)
-{
- set_function_pointers();
- return pg_popcount(buf, bytes);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount32_fast(uint32 word)
-{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount_fast
- * Returns the number of 1-bits in buf
- */
-static inline uint64
-pg_popcount_fast(const char *buf, int bytes)
-{
- uint64 popcnt = 0;
-
-#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += PG_POPCOUNT64(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
-#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
- {
- const uint32 *words = (const uint32 *) buf;
-
- while (bytes >= 4)
- {
- popcnt += PG_POPCOUNT32(*words++);
- bytes -= 4;
- }
-
- buf = (const char *) words;
- }
-#endif
-
- /* Process any remaining bytes */
- while (bytes--)
- popcnt += pg_number_of_ones[(unsigned char) *buf++];
-
- return popcnt;
-}
-
-#endif /* TRY_POPCNT_FAST */
-
/*
* pg_popcount32_slow
@@ -319,7 +150,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-uint64
+inline uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
new file mode 100644
index 0000000000..d63e8aa30f
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += PG_POPCOUNT64(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += PG_POPCOUNT32(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
new file mode 100644
index 0000000000..1a0022a0b3
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static inline void set_function_pointers()
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+}
+
+static int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
+ return pg_popcount32(word);
+}
+
+static int
+pg_popcount64_choose(uint64 word)
+{
+ set_function_pointers();
+ return pg_popcount64(word);
+}
+
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_function_pointers();
+ return pg_popcount(buf, bytes);
+}
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patchapplication/octet-stream; name=v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patchDownload
From b586e558d8eec0f5d5c45dfabec317ad5300e8d6 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:40:04 -0700
Subject: [PATCH 3/3] [Feature] Add POPCNT512 accelerated functionality for
x86_64.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
config/c-compiler.m4 | 37 +++++
configure | 205 +++++++++++++++++++++++++++
configure.ac | 44 ++++++
meson.build | 72 ++++++++++
src/Makefile.global.in | 1 +
src/backend/commands/user.c | 2 +-
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 6 +-
src/makefiles/meson.build | 1 +
src/port/Makefile | 5 +
src/port/meson.build | 6 +-
src/port/pg_popcount_x86_64_accel.c | 33 +++++
src/port/pg_popcount_x86_64_choose.c | 68 ++++++++-
13 files changed, 479 insertions(+), 13 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..94e3e713aa 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,40 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>],
+ [const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..0fbfc7c78f 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,175 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+
+$as_echo "#define HAVE__AVX512_POPCNT 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..3c741d457d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,38 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ AC_DEFINE(HAVE__AVX512_POPCNT, 1, [Define to 1 if you have cpu
+ support for AVX512 POPCNT.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index c8fdfeb0ec..d661405fee 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2157,6 +2188,47 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ #include <stdlib.h>
+ #include <string.h>
+ void main(void)
+ {
+ const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq flag.',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
# Library / OS tests
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index c75cde2e8e..77d72daa87 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -2330,7 +2330,7 @@ plan_single_revoke(CatCList *memlist, RevokeRoleGrantAction *actions,
* wouldn't work properly if such syntax were added, so assert that our
* caller isn't trying to do that.
*/
- Assert(pg_popcount32(popt->specified) <= 1);
+ Assert(PG_POPCOUNT32(popt->specified) <= 1);
for (i = 0; i < memlist->n_members; ++i)
{
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..33a831e768 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -558,6 +558,18 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have __get_cpuidex. */
+#undef HAVE__GET_CPUIDEX
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
+/* Define to 1 if you have AVX512. */
+#undef HAVE__AVX512_POPCNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 477e00e0da..ead4f79def 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -300,9 +300,9 @@ pg_ceil_log2_64(uint64 num)
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
-extern int pg_popcount32_slow(uint32 word);
-extern int pg_popcount64_slow(uint64 word);
-extern uint64 pg_popcount_slow(const char *buf, int bytes);
+extern inline int pg_popcount32_slow(uint32 word);
+extern inline int pg_popcount64_slow(uint64 word);
+extern inline uint64 pg_popcount_slow(const char *buf, int bytes);
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index 1499985dfc..66ef151565 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_popcount_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index cf6e9fa06c..0647e7a4f7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_popcount_x86_64_choose.c',
- 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -86,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
index d63e8aa30f..f65184ddce 100644
--- a/src/port/pg_popcount_x86_64_accel.c
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -13,10 +13,15 @@
#include "c.h"
#include "port/pg_bitutils.h"
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
#ifdef TRY_POPCNT_FAST
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
/*
* pg_popcount32_fast
@@ -98,4 +103,32 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * Use AVX-512 Intrinsics for supported CPUs or fall back the non-152 fast
+ * implem entation and use the best 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ #if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+#endif /* defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1 */
+
+ /* Process any remaining bytes */
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
index 1a0022a0b3..d25d251e9b 100644
--- a/src/port/pg_popcount_x86_64_choose.c
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -25,6 +25,7 @@
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
@@ -53,6 +54,52 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -61,17 +108,26 @@ pg_popcount_available(void)
*/
static inline void set_function_pointers()
{
- if (pg_popcount_available())
- {
+ if (pg_popcount512_available())
+ { /* If POPCNT512 is available, its assume that POPCNTQ is too. */
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
+ pg_popcount = pg_popcount512_fast;
}
else
{
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
}
}
--
2.34.1
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
Changed in this patch set.
Thanks for rebasing.
I don't think there's any need to mention Intel in each of the
following comments:
+# Check for Intel AVX512 intrinsics to do POPCNT calculations.
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
AMD's Zen4 also has AVX512, so it's misleading to indicate it's an
Intel only instruction. Also, writing the date isn't necessary as we
have "git blame"
David
-----Original Message-----
From: David Rowley <dgrowleyml@gmail.com>
Sent: Tuesday, March 19, 2024 9:26 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>AMD's Zen4 also has AVX512, so it's misleading to indicate it's an Intel only
instruction. Also, writing the date isn't necessary as we have "git blame"
Fixed.
Thanks,
Paul
Attachments:
v11-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patchapplication/octet-stream; name=v11-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patchDownload
From e9483d2354bd58210805b17fa78091ecc007c612 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:35:53 -0700
Subject: [PATCH 1/3] [Refactor] inlining and direct calls for *_slow
functions.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
src/backend/access/heap/visibilitymap.c | 6 +-
src/include/nodes/bitmapset.h | 4 +-
src/include/port/pg_bitutils.h | 17 +++--
src/port/pg_bitutils.c | 90 +++++--------------------
4 files changed, 34 insertions(+), 83 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..28dc497b79 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -419,14 +419,14 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
if (all_frozen == NULL)
{
for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
+ nvisible += PG_POPCOUNT64(map[i] & VISIBLE_MASK64);
}
else
{
for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
{
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
+ nvisible += PG_POPCOUNT64(map[i] & VISIBLE_MASK64);
+ nfrozen += PG_POPCOUNT64(map[i] & FROZEN_MASK64);
}
}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 283bea5ea9..b5631d153e 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -77,11 +77,11 @@ typedef enum
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
-#define bmw_popcount(w) pg_popcount32(w)
+#define bmw_popcount(w) PG_POPCOUNT32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
-#define bmw_popcount(w) pg_popcount64(w)
+#define bmw_popcount(w) PG_POPCOUNT64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..477e00e0da 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -300,16 +300,23 @@ pg_ceil_log2_64(uint64 num)
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
-
+#define PG_POPCOUNT32(x) pg_popcount32(x)
+#define PG_POPCOUNT64(x) pg_popcount64(x)
+#define PG_POPCOUNT(x,y) pg_popcount(x,y)
#else
/* Use a portable implementation -- no need for a function pointer. */
-extern int pg_popcount32(uint32 word);
-extern int pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
-
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+#define PG_POPCOUNT32(x) pg_popcount32_slow(x)
+#define PG_POPCOUNT64(x) pg_popcount64_slow(x)
+#define PG_POPCOUNT(x,y) pg_popcount_slow(x,y)
#endif /* TRY_POPCNT_FAST */
/*
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..e629969035 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,9 +103,9 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static inline int pg_popcount32_slow(uint32 word);
-static inline int pg_popcount64_slow(uint64 word);
-static uint64 pg_popcount_slow(const char *buf, int bytes);
+// static inline int pg_popcount32_slow(uint32 word);
+// static inline int pg_popcount64_slow(uint64 word);
+// static uint64 pg_popcount_slow(const char *buf, int bytes);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
@@ -119,9 +119,6 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
/*
* Return true if CPUID indicates that the POPCNT instruction is available.
@@ -148,8 +145,7 @@ pg_popcount_available(void)
* the function pointers so that subsequent calls are routed directly to
* the chosen implementation.
*/
-static int
-pg_popcount32_choose(uint32 word)
+static inline void set_function_pointers()
{
if (pg_popcount_available())
{
@@ -163,45 +159,26 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+static int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ set_function_pointers();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ set_function_pointers();
return pg_popcount(buf, bytes);
}
@@ -243,7 +220,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+static inline uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -256,7 +233,7 @@ pg_popcount_fast(const char *buf, int bytes)
while (bytes >= 8)
{
- popcnt += pg_popcount64_fast(*words++);
+ popcnt += PG_POPCOUNT64(*words++);
bytes -= 8;
}
@@ -270,7 +247,7 @@ pg_popcount_fast(const char *buf, int bytes)
while (bytes >= 4)
{
- popcnt += pg_popcount32_fast(*words++);
+ popcnt += PG_POPCOUNT32(*words++);
bytes -= 4;
}
@@ -292,7 +269,7 @@ pg_popcount_fast(const char *buf, int bytes)
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static inline int
+inline int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -314,7 +291,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static inline int
+inline int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -342,7 +319,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -383,36 +360,3 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
-
-#ifndef TRY_POPCNT_FAST
-
-/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-/*
- * pg_popcount
- * Returns the number of 1-bits in buf
- */
-uint64
-pg_popcount(const char *buf, int bytes)
-{
- return pg_popcount_slow(buf, bytes);
-}
-
-#endif /* !TRY_POPCNT_FAST */
--
2.34.1
v11-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patchapplication/octet-stream; name=v11-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patchDownload
From e5e4cac323b913e9fcdbd17d6b07316a21f7ff5c Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:37:31 -0700
Subject: [PATCH 2/3] [Refactor] Seperated slow, fast, and choose functionality
into files.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
contrib/intarray/_intbig_gist.c | 2 +-
contrib/ltree/_ltree_gist.c | 2 +-
contrib/pageinspect/heapfuncs.c | 4 +-
contrib/pg_trgm/trgm_gist.c | 2 +-
contrib/pg_walinspect/pg_walinspect.c | 2 +-
src/backend/lib/bloomfilter.c | 2 +-
src/backend/postmaster/syslogger.c | 2 +-
src/backend/utils/adt/tsgistidx.c | 2 +-
src/backend/utils/adt/varbit.c | 2 +-
src/backend/utils/adt/varlena.c | 2 +-
src/port/Makefile | 2 +
src/port/meson.build | 2 +
src/port/pg_bitutils.c | 171 +-------------------------
src/port/pg_popcount_x86_64_accel.c | 101 +++++++++++++++
src/port/pg_popcount_x86_64_choose.c | 98 +++++++++++++++
15 files changed, 215 insertions(+), 181 deletions(-)
create mode 100644 src/port/pg_popcount_x86_64_accel.c
create mode 100644 src/port/pg_popcount_x86_64_choose.c
diff --git a/contrib/intarray/_intbig_gist.c b/contrib/intarray/_intbig_gist.c
index 9699fbf3b4..a12ea7ed9b 100644
--- a/contrib/intarray/_intbig_gist.c
+++ b/contrib/intarray/_intbig_gist.c
@@ -210,7 +210,7 @@ g_intbig_compress(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/contrib/ltree/_ltree_gist.c b/contrib/ltree/_ltree_gist.c
index e89a39a5b5..bd66ec2e65 100644
--- a/contrib/ltree/_ltree_gist.c
+++ b/contrib/ltree/_ltree_gist.c
@@ -180,7 +180,7 @@ _ltree_union(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount((const char *) sign, siglen);
+ return PG_POPCOUNT((const char *) sign, siglen);
}
static int
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 3faeabc711..089842962f 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -527,8 +527,8 @@ heap_tuple_infomask_flags(PG_FUNCTION_ARGS)
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
- bitcnt = pg_popcount((const char *) &t_infomask, sizeof(uint16)) +
- pg_popcount((const char *) &t_infomask2, sizeof(uint16));
+ bitcnt = PG_POPCOUNT((const char *) &t_infomask, sizeof(uint16)) +
+ PG_POPCOUNT((const char *) &t_infomask2, sizeof(uint16));
/* If no flags, return a set of empty arrays */
if (bitcnt <= 0)
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 9ef2e38560..850316196f 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -648,7 +648,7 @@ gtrgm_same(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index ee2918726d..93a7b4842a 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -303,7 +303,7 @@ GetWALBlockInfo(FunctionCallInfo fcinfo, XLogReaderState *record,
block_fpi_len = blk->bimg_len;
/* Construct and save block_fpi_info */
- bitcnt = pg_popcount((const char *) &blk->bimg_info,
+ bitcnt = PG_POPCOUNT((const char *) &blk->bimg_info,
sizeof(uint8));
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
if ((blk->bimg_info & BKPIMAGE_HAS_HOLE) != 0)
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
index 360d21ca45..c01b069c01 100644
--- a/src/backend/lib/bloomfilter.c
+++ b/src/backend/lib/bloomfilter.c
@@ -187,7 +187,7 @@ double
bloom_prop_bits_set(bloom_filter *filter)
{
int bitset_bytes = filter->m / BITS_PER_BYTE;
- uint64 bits_set = pg_popcount((char *) filter->bitset, bitset_bytes);
+ uint64 bits_set = PG_POPCOUNT((char *) filter->bitset, bitset_bytes);
return bits_set / (double) filter->m;
}
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..85c57b3154 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
p.pid != 0 &&
- pg_popcount((char *) &dest_flags, 1) == 1)
+ PG_POPCOUNT((char *) &dest_flags, 1) == 1)
{
List *buffer_list;
ListCell *cell;
diff --git a/src/backend/utils/adt/tsgistidx.c b/src/backend/utils/adt/tsgistidx.c
index 5698ee5502..d7a76faf31 100644
--- a/src/backend/utils/adt/tsgistidx.c
+++ b/src/backend/utils/adt/tsgistidx.c
@@ -489,7 +489,7 @@ gtsvector_same(PG_FUNCTION_ARGS)
static int32
sizebitvec(BITVECP sign, int siglen)
{
- return pg_popcount(sign, siglen);
+ return PG_POPCOUNT(sign, siglen);
}
static int
diff --git a/src/backend/utils/adt/varbit.c b/src/backend/utils/adt/varbit.c
index 8fcf3fb731..3f287cd54d 100644
--- a/src/backend/utils/adt/varbit.c
+++ b/src/backend/utils/adt/varbit.c
@@ -1212,7 +1212,7 @@ bit_bit_count(PG_FUNCTION_ARGS)
{
VarBit *arg = PG_GETARG_VARBIT_P(0);
- PG_RETURN_INT64(pg_popcount((char *) VARBITS(arg), VARBITBYTES(arg)));
+ PG_RETURN_INT64(PG_POPCOUNT((char *) VARBITS(arg), VARBITBYTES(arg)));
}
/*
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8d28dd42ce..809e6a59ab 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -3151,7 +3151,7 @@ bytea_bit_count(PG_FUNCTION_ARGS)
{
bytea *t1 = PG_GETARG_BYTEA_PP(0);
- PG_RETURN_INT64(pg_popcount(VARDATA_ANY(t1), VARSIZE_ANY_EXHDR(t1)));
+ PG_RETURN_INT64(PG_POPCOUNT(VARDATA_ANY(t1), VARSIZE_ANY_EXHDR(t1)));
}
/*
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..1499985dfc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_x86_64_choose.o \
+ pg_popcount_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..cf6e9fa06c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_x86_64_choose.c',
+ 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index e629969035..f08820b35b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -11,14 +11,6 @@
*-------------------------------------------------------------------------
*/
#include "c.h"
-
-#ifdef HAVE__GET_CPUID
-#include <cpuid.h>
-#endif
-#ifdef HAVE__CPUID
-#include <intrin.h>
-#endif
-
#include "port/pg_bitutils.h"
@@ -103,167 +95,6 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-// static inline int pg_popcount32_slow(uint32 word);
-// static inline int pg_popcount64_slow(uint64 word);
-// static uint64 pg_popcount_slow(const char *buf, int bytes);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static uint64 pg_popcount_choose(const char *buf, int bytes);
-static inline int pg_popcount32_fast(uint32 word);
-static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-
-/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
- */
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static inline void set_function_pointers()
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-}
-
-static int
-pg_popcount32_choose(uint32 word)
-{
- set_function_pointers();
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- set_function_pointers();
- return pg_popcount64(word);
-}
-
-static uint64
-pg_popcount_choose(const char *buf, int bytes)
-{
- set_function_pointers();
- return pg_popcount(buf, bytes);
-}
-
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount32_fast(uint32 word)
-{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount64_fast(uint64 word)
-{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
-}
-
-/*
- * pg_popcount_fast
- * Returns the number of 1-bits in buf
- */
-static inline uint64
-pg_popcount_fast(const char *buf, int bytes)
-{
- uint64 popcnt = 0;
-
-#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += PG_POPCOUNT64(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
-#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
- {
- const uint32 *words = (const uint32 *) buf;
-
- while (bytes >= 4)
- {
- popcnt += PG_POPCOUNT32(*words++);
- bytes -= 4;
- }
-
- buf = (const char *) words;
- }
-#endif
-
- /* Process any remaining bytes */
- while (bytes--)
- popcnt += pg_number_of_ones[(unsigned char) *buf++];
-
- return popcnt;
-}
-
-#endif /* TRY_POPCNT_FAST */
-
/*
* pg_popcount32_slow
@@ -319,7 +150,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-uint64
+inline uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
new file mode 100644
index 0000000000..d63e8aa30f
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += PG_POPCOUNT64(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += PG_POPCOUNT32(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
new file mode 100644
index 0000000000..1a0022a0b3
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static inline void set_function_pointers()
+{
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+}
+
+static int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
+ return pg_popcount32(word);
+}
+
+static int
+pg_popcount64_choose(uint64 word)
+{
+ set_function_pointers();
+ return pg_popcount64(word);
+}
+
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_function_pointers();
+ return pg_popcount(buf, bytes);
+}
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
v11-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patchapplication/octet-stream; name=v11-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patchDownload
From b586e558d8eec0f5d5c45dfabec317ad5300e8d6 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Tue, 19 Mar 2024 13:40:04 -0700
Subject: [PATCH 3/3] [Feature] Add POPCNT512 accelerated functionality for
x86_64.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
config/c-compiler.m4 | 37 +++++
configure | 205 +++++++++++++++++++++++++++
configure.ac | 44 ++++++
meson.build | 72 ++++++++++
src/Makefile.global.in | 1 +
src/backend/commands/user.c | 2 +-
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 6 +-
src/makefiles/meson.build | 1 +
src/port/Makefile | 5 +
src/port/meson.build | 6 +-
src/port/pg_popcount_x86_64_accel.c | 33 +++++
src/port/pg_popcount_x86_64_choose.c | 68 ++++++++-
13 files changed, 479 insertions(+), 13 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..94e3e713aa 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,40 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>],
+ [const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..0fbfc7c78f 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,175 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+
+$as_echo "#define HAVE__AVX512_POPCNT 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..3c741d457d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,38 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ AC_DEFINE(HAVE__AVX512_POPCNT, 1, [Define to 1 if you have cpu
+ support for AVX512 POPCNT.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index c8fdfeb0ec..d661405fee 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2157,6 +2188,47 @@ elif host_cpu == 'ppc' or host_cpu == 'ppc64'
endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ #include <stdlib.h>
+ #include <string.h>
+ void main(void)
+ {
+ const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq flag.',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
# Library / OS tests
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index c75cde2e8e..77d72daa87 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -2330,7 +2330,7 @@ plan_single_revoke(CatCList *memlist, RevokeRoleGrantAction *actions,
* wouldn't work properly if such syntax were added, so assert that our
* caller isn't trying to do that.
*/
- Assert(pg_popcount32(popt->specified) <= 1);
+ Assert(PG_POPCOUNT32(popt->specified) <= 1);
for (i = 0; i < memlist->n_members; ++i)
{
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..33a831e768 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -558,6 +558,18 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have __get_cpuidex. */
+#undef HAVE__GET_CPUIDEX
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
+/* Define to 1 if you have AVX512. */
+#undef HAVE__AVX512_POPCNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 477e00e0da..ead4f79def 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -300,9 +300,9 @@ pg_ceil_log2_64(uint64 num)
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
-extern int pg_popcount32_slow(uint32 word);
-extern int pg_popcount64_slow(uint64 word);
-extern uint64 pg_popcount_slow(const char *buf, int bytes);
+extern inline int pg_popcount32_slow(uint32 word);
+extern inline int pg_popcount64_slow(uint64 word);
+extern inline uint64 pg_popcount_slow(const char *buf, int bytes);
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index 1499985dfc..66ef151565 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer processors can use AVX-512 POPCNT Capabilities
+pg_popcount_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index cf6e9fa06c..0647e7a4f7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_popcount_x86_64_choose.c',
- 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -86,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
index d63e8aa30f..f65184ddce 100644
--- a/src/port/pg_popcount_x86_64_accel.c
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -13,10 +13,15 @@
#include "c.h"
#include "port/pg_bitutils.h"
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
#ifdef TRY_POPCNT_FAST
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
/*
* pg_popcount32_fast
@@ -98,4 +103,32 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * Use AVX-512 Intrinsics for supported CPUs or fall back the non-152 fast
+ * implem entation and use the best 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ #if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+#endif /* defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1 */
+
+ /* Process any remaining bytes */
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
index 1a0022a0b3..d25d251e9b 100644
--- a/src/port/pg_popcount_x86_64_choose.c
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -25,6 +25,7 @@
int pg_popcount32_fast(uint32 word);
int pg_popcount64_fast(uint64 word);
uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
@@ -53,6 +54,52 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
/*
* These functions get called on the first call to pg_popcount32 etc.
* They detect whether we can use the asm implementations, and replace
@@ -61,17 +108,26 @@ pg_popcount_available(void)
*/
static inline void set_function_pointers()
{
- if (pg_popcount_available())
- {
+ if (pg_popcount512_available())
+ { /* If POPCNT512 is available, its assume that POPCNTQ is too. */
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
+ pg_popcount = pg_popcount512_fast;
}
else
{
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
}
}
--
2.34.1
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
Changed in this patch set.
* Rebased.
* Direct *slow* calls via macros as shown in example patch.
* Changed the choose filename to be platform specific as suggested.
* Falls back to intermediate "Fast" methods if AVX512 is not available at runtime.
* inline used where is makes sense, remember using "extern" negates "inline".
I'm not sure about this "extern negates inline" comment. It seems to
me the compiler is perfectly free to inline a static function into an
external function and it's free to inline the static function
elsewhere within the same .c file.
The final sentence of the following comment that the 0001 patch
removes explains this:
/*
* When the POPCNT instruction is not available, there's no point in using
* function pointers to vary the implementation between the fast and slow
* method. We instead just make these actual external functions when
* TRY_POPCNT_FAST is not defined. The compiler should be able to inline
* the slow versions here.
*/
Also, have a look at [1]https://godbolt.org/z/6joExb79d. You'll see f_slow() wasn't even compiled
and the code was just inlined into f(). I just added the
__attribute__((noinline)) so that usage() wouldn't just perform
constant folding and just return 6.
I think, unless you have evidence that some common compiler isn't
inlining the static into the extern then we shouldn't add the macros.
It adds quite a bit of churn to the patch and will break out of core
code as you no longer have functions named pg_popcount32(),
pg_popcount64() and pg_popcount().
David
-----Original Message-----
From: David Rowley <dgrowleyml@gmail.com>
Sent: Wednesday, March 20, 2024 5:28 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>; Andres FreundI'm not sure about this "extern negates inline" comment. It seems to me the
compiler is perfectly free to inline a static function into an external function
and it's free to inline the static function elsewhere within the same .c file.The final sentence of the following comment that the 0001 patch removes
explains this:/*
* When the POPCNT instruction is not available, there's no point in using
* function pointers to vary the implementation between the fast and slow
* method. We instead just make these actual external functions when
* TRY_POPCNT_FAST is not defined. The compiler should be able to inline
* the slow versions here.
*/Also, have a look at [1]. You'll see f_slow() wasn't even compiled and the code
was just inlined into f(). I just added the
__attribute__((noinline)) so that usage() wouldn't just perform constant
folding and just return 6.I think, unless you have evidence that some common compiler isn't inlining the
static into the extern then we shouldn't add the macros.
It adds quite a bit of churn to the patch and will break out of core code as you
no longer have functions named pg_popcount32(),
pg_popcount64() and pg_popcount().
This may be a simple misunderstanding extern != static. If I use the "extern" keyword then a symbol *will* be generated and inline will be ignored. This is NOT true of "static inline", where the compiler will try to inline the method. :)
In this patch set:
* I removed the macro implementation.
* Made everything that could possibly be inlined marked with the "static inline" keyword.
* Conditionally made the *_slow() functions "static inline" when TRY_POPCONT_FAST is not set.
* Found and fixed some whitespace errors in the AVX code implementation.
Thanks,
Paul
Attachments:
v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patchapplication/octet-stream; name=v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patchDownload
From ba867ed01a1a25b2603eeda13a44e94a0a61648e Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Thu, 21 Mar 2024 11:19:23 -0700
Subject: [PATCH 1/2] [Refactor] Split pg_popcount functions into multiple
files.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
src/include/port/pg_bitutils.h | 6 +-
src/port/Makefile | 2 +
src/port/meson.build | 2 +
src/port/pg_bitutils.c | 232 +++------------------------
src/port/pg_popcount_x86_64_accel.c | 134 ++++++++++++++++
src/port/pg_popcount_x86_64_choose.c | 158 ++++++++++++++++++
6 files changed, 324 insertions(+), 210 deletions(-)
create mode 100644 src/port/pg_popcount_x86_64_accel.c
create mode 100644 src/port/pg_popcount_x86_64_choose.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..26f6a48377 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -305,11 +305,13 @@ extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
-/* Use a portable implementation -- no need for a function pointer. */
+/*
+ * Use a portable implementation -- no need for a function pointer. Use
+ * inlining for small speed increase.
+ */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount(const char *buf, int bytes);
-
#endif /* TRY_POPCNT_FAST */
/*
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..1499985dfc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_x86_64_choose.o \
+ pg_popcount_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..cf6e9fa06c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_x86_64_choose.c',
+ 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..21a4d0ca97 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -21,7 +21,6 @@
#include "port/pg_bitutils.h"
-
/*
* Array giving the position of the left-most set bit for each possible
* byte value. We count the right-most position as the 0th bit, and the
@@ -103,196 +102,46 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static inline int pg_popcount32_slow(uint32 word);
-static inline int pg_popcount64_slow(uint64 word);
-static uint64 pg_popcount_slow(const char *buf, int bytes);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static uint64 pg_popcount_choose(const char *buf, int bytes);
-static inline int pg_popcount32_fast(uint32 word);
-static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
+#ifndef TRY_POPCNT_FAST
/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
+ * Optimize function signature if using the slow functions.
*/
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
+#define INLINE static inline
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
- return pg_popcount64(word);
-}
-
-static uint64
-pg_popcount_choose(const char *buf, int bytes)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
+#define INLINE
+#endif /* !TRY_POPCNT_FAST */
- return pg_popcount(buf, bytes);
-}
+/* Forward References */
+INLINE int pg_popcount32_slow(uint32 word);
+INLINE int pg_popcount64_slow(uint64 word);
+INLINE uint64 pg_popcount_slow(const char *buf, int bytes);
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount32_fast(uint32 word)
+#ifndef TRY_POPCNT_FAST
+/* Slow function defintions for exported functions. */
+int
+pg_popcount32(uint32 word)
{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount32_slow(word);
}
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount64_fast(uint64 word)
+int
+pg_popcount64(uint64 word)
{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount64_slow(word);
}
-
-/*
- * pg_popcount_fast
- * Returns the number of 1-bits in buf
- */
-static uint64
-pg_popcount_fast(const char *buf, int bytes)
+uint64
+pg_popcount(const char *buf, int bytes)
{
- uint64 popcnt = 0;
-
-#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += pg_popcount64_fast(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
-#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
- {
- const uint32 *words = (const uint32 *) buf;
-
- while (bytes >= 4)
- {
- popcnt += pg_popcount32_fast(*words++);
- bytes -= 4;
- }
-
- buf = (const char *) words;
- }
-#endif
-
- /* Process any remaining bytes */
- while (bytes--)
- popcnt += pg_number_of_ones[(unsigned char) *buf++];
-
- return popcnt;
+ return pg_popcount_slow(buf, bytes);
}
-
-#endif /* TRY_POPCNT_FAST */
-
+#endif /* !TRY_POPCNT_FAST */
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static inline int
+INLINE int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -314,7 +163,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static inline int
+INLINE int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -342,7 +191,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-static uint64
+INLINE uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -383,36 +232,3 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
-
-#ifndef TRY_POPCNT_FAST
-
-/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-/*
- * pg_popcount
- * Returns the number of 1-bits in buf
- */
-uint64
-pg_popcount(const char *buf, int bytes)
-{
- return pg_popcount_slow(buf, bytes);
-}
-
-#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
new file mode 100644
index 0000000000..d5500d56e7
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -0,0 +1,134 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
+/*
+ * Use AVX-512 Intrinsics for supported CPUs or fall back the non-152 fast
+ * implem entation and use the best 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ #if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+#endif /* defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1 */
+
+ /* Process any remaining bytes */
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
new file mode 100644
index 0000000000..e73d1999ad
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+static bool pg_popcount_available(void);
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount512_fast(const char *buf, int bytes);
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static inline void set_function_pointers()
+{
+if (pg_popcount512_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+ }
+}
+
+static inline int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
+ return pg_popcount32(word);
+}
+
+static inline int
+pg_popcount64_choose(uint64 word)
+{
+ set_function_pointers();
+ return pg_popcount64(word);
+}
+
+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_function_pointers();
+ return pg_popcount(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patchapplication/octet-stream; name=v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patchDownload
From 86415641d953ef7bc36fe91f75ce9c69878cba63 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Thu, 21 Mar 2024 11:39:37 -0700
Subject: [PATCH 2/2] [Feature] Added AVX-512 acceleration to the pg_popcount
function.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
config/c-compiler.m4 | 37 ++++++
configure | 205 +++++++++++++++++++++++++++++++++
configure.ac | 44 +++++++
meson.build | 72 ++++++++++++
src/Makefile.global.in | 1 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 4 +-
src/makefiles/meson.build | 1 +
src/port/Makefile | 5 +
src/port/meson.build | 6 +-
10 files changed, 382 insertions(+), 5 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..94e3e713aa 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,40 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>],
+ [const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..7253cb8154 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,175 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+
+$as_echo "#define HAVE__AVX512_POPCNT 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..06e8eaa6d0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,38 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ AC_DEFINE(HAVE__AVX512_POPCNT, 1, [Define to 1 if you have cpu
+ support for AVX512 POPCNT.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index c8fdfeb0ec..cd508096e5 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2158,6 +2189,47 @@ endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ #include <stdlib.h>
+ #include <string.h>
+ void main(void)
+ {
+ const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq flag.',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
# Library / OS tests
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..33a831e768 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -558,6 +558,18 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have __get_cpuidex. */
+#undef HAVE__GET_CPUIDEX
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
+/* Define to 1 if you have AVX512. */
+#undef HAVE__AVX512_POPCNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 26f6a48377..8215316b0e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -306,8 +306,8 @@ extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/*
- * Use a portable implementation -- no need for a function pointer. Use
- * inlining for small speed increase.
+ * Use a portable implementation -- no need for a function pointer. Use
+ * inlining for small speed increase.
*/
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index 1499985dfc..92bd419953 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer processors can use AVX-512 POPCNT Capabilities
+pg_popcount_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index cf6e9fa06c..0647e7a4f7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_popcount_x86_64_choose.c',
- 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -86,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
--
2.34.1
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Thursday, March 21, 2024 12:18 PM
To: David Rowley <dgrowleyml@gmail.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>; Andres Freund
I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time.
Thanks,
Paul
Attachments:
v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patchapplication/octet-stream; name=v12-0001-Refactor-Split-pg_popcount-functions-into-multiple-f.patchDownload
From ba867ed01a1a25b2603eeda13a44e94a0a61648e Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Thu, 21 Mar 2024 11:19:23 -0700
Subject: [PATCH 1/2] [Refactor] Split pg_popcount functions into multiple
files.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
src/include/port/pg_bitutils.h | 6 +-
src/port/Makefile | 2 +
src/port/meson.build | 2 +
src/port/pg_bitutils.c | 232 +++------------------------
src/port/pg_popcount_x86_64_accel.c | 134 ++++++++++++++++
src/port/pg_popcount_x86_64_choose.c | 158 ++++++++++++++++++
6 files changed, 324 insertions(+), 210 deletions(-)
create mode 100644 src/port/pg_popcount_x86_64_accel.c
create mode 100644 src/port/pg_popcount_x86_64_choose.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..26f6a48377 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -305,11 +305,13 @@ extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
-/* Use a portable implementation -- no need for a function pointer. */
+/*
+ * Use a portable implementation -- no need for a function pointer. Use
+ * inlining for small speed increase.
+ */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount(const char *buf, int bytes);
-
#endif /* TRY_POPCNT_FAST */
/*
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..1499985dfc 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,8 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_x86_64_choose.o \
+ pg_popcount_x86_64_accel.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..cf6e9fa06c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,8 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_x86_64_choose.c',
+ 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..21a4d0ca97 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -21,7 +21,6 @@
#include "port/pg_bitutils.h"
-
/*
* Array giving the position of the left-most set bit for each possible
* byte value. We count the right-most position as the 0th bit, and the
@@ -103,196 +102,46 @@ const uint8 pg_number_of_ones[256] = {
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
-static inline int pg_popcount32_slow(uint32 word);
-static inline int pg_popcount64_slow(uint64 word);
-static uint64 pg_popcount_slow(const char *buf, int bytes);
-
-#ifdef TRY_POPCNT_FAST
-static bool pg_popcount_available(void);
-static int pg_popcount32_choose(uint32 word);
-static int pg_popcount64_choose(uint64 word);
-static uint64 pg_popcount_choose(const char *buf, int bytes);
-static inline int pg_popcount32_fast(uint32 word);
-static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
-
-int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
-int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
-#endif /* TRY_POPCNT_FAST */
-
-#ifdef TRY_POPCNT_FAST
-
+#ifndef TRY_POPCNT_FAST
/*
- * Return true if CPUID indicates that the POPCNT instruction is available.
+ * Optimize function signature if using the slow functions.
*/
-static bool
-pg_popcount_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
+#define INLINE static inline
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
#else
-#error cpuid instruction not available
-#endif
-
- return (exx[2] & (1 << 23)) != 0; /* POPCNT */
-}
-
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
- return pg_popcount32(word);
-}
-
-static int
-pg_popcount64_choose(uint64 word)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
- return pg_popcount64(word);
-}
-
-static uint64
-pg_popcount_choose(const char *buf, int bytes)
-{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
+#define INLINE
+#endif /* !TRY_POPCNT_FAST */
- return pg_popcount(buf, bytes);
-}
+/* Forward References */
+INLINE int pg_popcount32_slow(uint32 word);
+INLINE int pg_popcount64_slow(uint64 word);
+INLINE uint64 pg_popcount_slow(const char *buf, int bytes);
-/*
- * pg_popcount32_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount32_fast(uint32 word)
+#ifndef TRY_POPCNT_FAST
+/* Slow function defintions for exported functions. */
+int
+pg_popcount32(uint32 word)
{
-#ifdef _MSC_VER
- return __popcnt(word);
-#else
- uint32 res;
-
-__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount32_slow(word);
}
-/*
- * pg_popcount64_fast
- * Return the number of 1 bits set in word
- */
-static inline int
-pg_popcount64_fast(uint64 word)
+int
+pg_popcount64(uint64 word)
{
-#ifdef _MSC_VER
- return __popcnt64(word);
-#else
- uint64 res;
-
-__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
- return (int) res;
-#endif
+ return pg_popcount64_slow(word);
}
-
-/*
- * pg_popcount_fast
- * Returns the number of 1-bits in buf
- */
-static uint64
-pg_popcount_fast(const char *buf, int bytes)
+uint64
+pg_popcount(const char *buf, int bytes)
{
- uint64 popcnt = 0;
-
-#if SIZEOF_VOID_P >= 8
- /* Process in 64-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(8, buf))
- {
- const uint64 *words = (const uint64 *) buf;
-
- while (bytes >= 8)
- {
- popcnt += pg_popcount64_fast(*words++);
- bytes -= 8;
- }
-
- buf = (const char *) words;
- }
-#else
- /* Process in 32-bit chunks if the buffer is aligned. */
- if (buf == (const char *) TYPEALIGN(4, buf))
- {
- const uint32 *words = (const uint32 *) buf;
-
- while (bytes >= 4)
- {
- popcnt += pg_popcount32_fast(*words++);
- bytes -= 4;
- }
-
- buf = (const char *) words;
- }
-#endif
-
- /* Process any remaining bytes */
- while (bytes--)
- popcnt += pg_number_of_ones[(unsigned char) *buf++];
-
- return popcnt;
+ return pg_popcount_slow(buf, bytes);
}
-
-#endif /* TRY_POPCNT_FAST */
-
+#endif /* !TRY_POPCNT_FAST */
/*
* pg_popcount32_slow
* Return the number of 1 bits set in word
*/
-static inline int
+INLINE int
pg_popcount32_slow(uint32 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -314,7 +163,7 @@ pg_popcount32_slow(uint32 word)
* pg_popcount64_slow
* Return the number of 1 bits set in word
*/
-static inline int
+INLINE int
pg_popcount64_slow(uint64 word)
{
#ifdef HAVE__BUILTIN_POPCOUNT
@@ -342,7 +191,7 @@ pg_popcount64_slow(uint64 word)
* pg_popcount_slow
* Returns the number of 1-bits in buf
*/
-static uint64
+INLINE uint64
pg_popcount_slow(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -383,36 +232,3 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
-
-#ifndef TRY_POPCNT_FAST
-
-/*
- * When the POPCNT instruction is not available, there's no point in using
- * function pointers to vary the implementation between the fast and slow
- * method. We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined. The compiler should be able to inline
- * the slow versions here.
- */
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_slow(word);
-}
-
-int
-pg_popcount64(uint64 word)
-{
- return pg_popcount64_slow(word);
-}
-
-/*
- * pg_popcount
- * Returns the number of 1-bits in buf
- */
-uint64
-pg_popcount(const char *buf, int bytes)
-{
- return pg_popcount_slow(buf, bytes);
-}
-
-#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_accel.c b/src/port/pg_popcount_x86_64_accel.c
new file mode 100644
index 0000000000..d5500d56e7
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_accel.c
@@ -0,0 +1,134 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_accel.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_accel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#if defined(HAVE__IMMINTRIN)
+#include <immintrin.h>
+#endif
+
+#ifdef TRY_POPCNT_FAST
+int pg_popcount32_fast(uint32 word);
+int pg_popcount64_fast(uint64 word);
+uint64 pg_popcount_fast(const char *buf, int bytes);
+uint64 pg_popcount512_fast(const char *buf, int bytes);
+
+/*
+ * pg_popcount32_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount32_fast(uint32 word)
+{
+#ifdef _MSC_VER
+ return __popcnt(word);
+#else
+ uint32 res;
+
+__asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount64_fast
+ * Return the number of 1 bits set in word
+ */
+int
+pg_popcount64_fast(uint64 word)
+{
+#ifdef _MSC_VER
+ return __popcnt64(word);
+#else
+ uint64 res;
+
+__asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
+ return (int) res;
+#endif
+}
+
+/*
+ * pg_popcount_fast
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+ return popcnt;
+}
+
+/*
+ * Use AVX-512 Intrinsics for supported CPUs or fall back the non-152 fast
+ * implem entation and use the best 64 bit fast methods. If no fast
+ * methods are used this will fall back to __builtin_* or pure software.
+ */
+uint64
+pg_popcount512_fast(const char *buf, int bytes)
+{
+ uint64 popcnt = 0;
+ #if defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1
+ __m512i accumulator = _mm512_setzero_si512();
+
+ while (bytes >= 64)
+ {
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+
+ accumulator = _mm512_add_epi64(accumulator, p);
+ bytes -= 64;
+ buf += 64;
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+#endif /* defined(HAVE__IMMINTRIN) && HAVE__AVX512_POPCNT == 1 */
+
+ /* Process any remaining bytes */
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_x86_64_choose.c b/src/port/pg_popcount_x86_64_choose.c
new file mode 100644
index 0000000000..e73d1999ad
--- /dev/null
+++ b/src/port/pg_popcount_x86_64_choose.c
@@ -0,0 +1,158 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_x86_64_choose.c
+ * Miscellaneous functions for bit-wise operations.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_x86_64_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_FAST
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+static bool pg_popcount_available(void);
+static int pg_popcount32_choose(uint32 word);
+static int pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+extern int pg_popcount32_fast(uint32 word);
+extern int pg_popcount64_fast(uint64 word);
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount512_fast(const char *buf, int bytes);
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount64_slow(uint64 word);
+extern uint64 pg_popcount_slow(const char *buf, int bytes);
+
+int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
+int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+
+/*
+ * Return true if CPUID indicates that the POPCNT instruction is available.
+ */
+static bool
+pg_popcount_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (exx[2] & (1 << 23)) != 0; /* POPCNT */
+}
+
+/*
+ * Return true if CPUID indicates that the AVX512_POPCNT instruction is
+ * available. This is similar to the method above; see
+ * https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features
+ *
+ * Finally, we make sure the xgetbv result is consistent with the CPUID
+ * results.
+ */
+static bool
+pg_popcount512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* Check for AVX512VPOPCNTDQ and AVX512F */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+
+ if ((exx[2] & (0x00004000)) != 0 && (exx[1] & (0x00010000)) != 0)
+ {
+ /*
+ * CPUID succeeded, does the current running OS support the
+ * ZMM registers which are required for AVX512? This check is
+ * required to make sure an old OS on a new CPU is correctly
+ * checked or a VM hypervisor is not excluding AVX512 ZMM
+ * support in the VM; see "5.1.9 Detection of AVX Instructions"
+ * https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html
+ */
+ uint64 xcr = 0;
+#ifdef _MSC_VER
+ uint64 highlow = _xgetbv(xcr);
+
+ return (highlow & 0xE0) != 0;
+#else
+ uint32 high;
+ uint32 low;
+
+ __asm__ __volatile__("xgetbv\t\n" : "=a"(low), "=d"(high) : "c"(xcr));
+ return (low & 0xE0) != 0;
+#endif
+ } /* POPCNT 512 */
+ return false;
+}
+
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static inline void set_function_pointers()
+{
+if (pg_popcount512_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ if (pg_popcount_available())
+ {
+ pg_popcount32 = pg_popcount32_fast;
+ pg_popcount64 = pg_popcount64_fast;
+ pg_popcount = pg_popcount_fast;
+ }
+ else
+ {
+ pg_popcount32 = pg_popcount32_slow;
+ pg_popcount64 = pg_popcount64_slow;
+ pg_popcount = pg_popcount_slow;
+ }
+ }
+}
+
+static inline int
+pg_popcount32_choose(uint32 word)
+{
+ set_function_pointers();
+ return pg_popcount32(word);
+}
+
+static inline int
+pg_popcount64_choose(uint64 word)
+{
+ set_function_pointers();
+ return pg_popcount64(word);
+}
+
+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+ set_function_pointers();
+ return pg_popcount(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
--
2.34.1
v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patchapplication/octet-stream; name=v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patchDownload
From 86415641d953ef7bc36fe91f75ce9c69878cba63 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amonson@intel.com>
Date: Thu, 21 Mar 2024 11:39:37 -0700
Subject: [PATCH 2/2] [Feature] Added AVX-512 acceleration to the pg_popcount
function.
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
---
config/c-compiler.m4 | 37 ++++++
configure | 205 +++++++++++++++++++++++++++++++++
configure.ac | 44 +++++++
meson.build | 72 ++++++++++++
src/Makefile.global.in | 1 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 4 +-
src/makefiles/meson.build | 1 +
src/port/Makefile | 5 +
src/port/meson.build | 6 +-
10 files changed, 382 insertions(+), 5 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..94e3e713aa 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,40 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# ---------------------------
+# Check if the compiler supports the x86_64 AVX512 POPCNT instructions using
+# intrinsics used in CPUID features AVX512F and AVX512VPOPCNTDQ.
+#
+# Optional compiler flags can be passed as argument (e.g. -mavx512vpopcntdq).
+# If the intrinsics are supported then pgac_avx512_popcnt_intrinsics and
+# CFLAGS_AVX512_POPCNT are set.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>],
+ [const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..7253cb8154 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,175 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __immintrin" >&5
+$as_echo_n "checking for __immintrin... " >&6; }
+if ${pgac_cv__immintrin+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+/* Don't exclude code so added return. */
+ return 1701;
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__immintrin="yes"
+else
+ pgac_cv__immintrin="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__immintrin" >&5
+$as_echo "$pgac_cv__immintrin" >&6; }
+if test x"$pgac_cv__immintrin" = x"yes"; then
+
+$as_echo "#define HAVE__IMMINTRIN 1" >>confdefs.h
+
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+int
+main ()
+{
+const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+
+$as_echo "#define HAVE__AVX512_POPCNT 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..06e8eaa6d0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,38 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+AC_CACHE_CHECK([for __immintrin], [pgac_cv__immintrin],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [[/* Don't exclude code so added return. */
+ return 1701;
+ ]])],
+ [pgac_cv__immintrin="yes"],
+ [pgac_cv__immintrin="no"])])
+if test x"$pgac_cv__immintrin" = x"yes"; then
+ AC_DEFINE(HAVE__IMMINTRIN, 1, [Define to 1 if you have immintrin.])
+fi
+
+# Check for AVX512 intrinsics to do POPCNT calculations.
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ AC_DEFINE(HAVE__AVX512_POPCNT, 1, [Define to 1 if you have cpu
+ support for AVX512 POPCNT.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index c8fdfeb0ec..cd508096e5 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,37 @@ elif cc.links('''
endif
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.has_header('immintrin.h',
+ include_directories: postgres_inc, args: test_c_args)
+ cdata.set('HAVE__IMMINTRIN', 1,
+ description: 'Define to 1 if you have the immintrin.h header file.')
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -2158,6 +2189,47 @@ endif
+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+ test_flags = ['-mavx512vpopcntdq']
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
+ prog = '''
+ #include <immintrin.h>
+ #include <stdint.h>
+ #include <stdlib.h>
+ #include <string.h>
+ void main(void)
+ {
+ const uint64_t *buf = malloc((size_t)64);
+ uint64_t popcnt = 0;
+ __m512i accumulator = _mm512_setzero_si512();
+ const __m512i v = _mm512_loadu_si512((const __m512i *)buf);
+ const __m512i p = _mm512_popcnt_epi64(v);
+ memset(buf, 0, 64);
+ accumulator = _mm512_add_epi64(accumulator, p);
+ popcnt = _mm512_reduce_add_epi64(accumulator);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ }'''
+ if cc.links(prog, name: '_mm512_* methods with -mavx512vpopcntdq flag.',
+ args: test_c_args + test_flags)
+ have_avx512_popcnt = true
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cdata.set('HAVE__AVX512_POPCNT', 1)
+ cflags_avx512_popcnt = test_flags
+ else
+ have_avx512_popcnt = false
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt = []
+ endif # compile/link test
+endif # host_cpu check
+
###############################################################
# Library / OS tests
###############################################################
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..089f49b7f3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..33a831e768 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -558,6 +558,18 @@
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
+/* Define to 1 if you have __get_cpuidex. */
+#undef HAVE__GET_CPUIDEX
+
+/* Define to 1 if you have immintrin. */
+#undef HAVE__IMMINTRIN
+
+/* Define to 1 if you have AVX512. */
+#undef HAVE__AVX512_POPCNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 26f6a48377..8215316b0e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -306,8 +306,8 @@ extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
#else
/*
- * Use a portable implementation -- no need for a function pointer. Use
- * inlining for small speed increase.
+ * Use a portable implementation -- no need for a function pointer. Use
+ * inlining for small speed increase.
*/
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..ee3647282e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,6 +100,7 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index 1499985dfc..92bd419953 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -89,6 +89,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer processors can use AVX-512 POPCNT Capabilities
+pg_popcount_x86_64_accel.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_x86_64_accel_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index cf6e9fa06c..0647e7a4f7 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,7 +8,6 @@ pgport_sources = [
'path.c',
'pg_bitutils.c',
'pg_popcount_x86_64_choose.c',
- 'pg_popcount_x86_64_accel.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -86,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_x86_64_accel', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512': []}
foreach f : replace_funcs_neg
func = f.get(0)
--
2.34.1
"Amonson, Paul D" <paul.d.amonson@intel.com> writes:
I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time.
Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump. The current cycle period seems to
be about two days.
regards, tom lane
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, March 25, 2024 8:12 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: David Rowley <dgrowleyml@gmail.com>; Nathan Bossart
Subject: Re: Popcount optimization using AVX512
...
Just for a note --- the cfbot will re-test existing patches every so often without
needing a bump. The current cycle period seems to be about two days.regards, tom lane
Good to know! Maybe this is why I thought it originally passed CI and suddenly this morning there is a failure. I noticed at least 2 other patch runs also failed in the same way.
Thanks,
Paul
On 3/25/24 11:12, Tom Lane wrote:
"Amonson, Paul D" <paul.d.amonson@intel.com> writes:
I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time.
Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump. The current cycle period seems to
be about two days.
Just an FYI -- there seems to be an issue with all three of the macos
cfbot runners (mine included). I spent time over the weekend working
with Thomas Munro (added to CC list) trying different fixes to no avail.
Help from macos CI wizards would be gratefully accepted...
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Monday, March 25, 2024 8:20 AM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: David Rowley <dgrowleyml@gmail.com>; Nathan Bossart
<nathandbossart@gmail.com>; Andres Freund <andres@anarazel.de>; Alvaro
Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash
<akash.shankaran@intel.com>; Noah Misch <noah@leadboat.com>; Matthias
van de Meent <boekewurm+postgres@gmail.com>; pgsql-
hackers@lists.postgresql.org
Subject: RE: Popcount optimization using AVX512
Ok, CI turned green after my re-post of the patches. Can this please get merged?
Thanks,
Paul
On Mon, Mar 25, 2024 at 06:42:36PM +0000, Amonson, Paul D wrote:
Ok, CI turned green after my re-post of the patches. Can this please get
merged?
Thanks for the new patches. I intend to take another look soon.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote:
On Mon, Mar 25, 2024 at 06:42:36PM +0000, Amonson, Paul D wrote:
Ok, CI turned green after my re-post of the patches. Can this please get
merged?Thanks for the new patches. I intend to take another look soon.
Thanks for your patience. I spent most of my afternoon looking into the
latest patch set, but I needed to do a CHECKPOINT and take a break. I am
in the middle of doing some rather heavy editorialization, but the core of
your changes will remain the same (and so I still intend to give you
authorship credit). I've attached what I have so far, which is still
missing the configuration checks and the changes to make sure the extra
compiler flags make it to the right places.
Unless something pops up while I work on the remainder of this patch, I
think we'll end up going with a simpler approach. I originally set out to
make this look like the CRC32C stuff (e.g., a file per implementation), but
that seemed primarily useful if we can choose which files need to be
compiled at configure-time. However, the TRY_POPCNT_FAST macro is defined
at compile-time (AFAICT for good reason [0]/messages/by-id/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY+rBOCBuYcQVUOURwhkA@mail.gmail.com), so we end up having to
compile all the files in many cases anyway, and we continue to need to
surround lots of code with "#ifdef TRY_POPCNT_FAST" or similar. So, my
current thinking is that we should only move the AVX512 stuff to its own
file for the purposes of compiling it with special flags when possible. (I
realize that I'm essentially recanting much of my previous feedback, which
I apologize for.)
[0]: /messages/by-id/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY+rBOCBuYcQVUOURwhkA@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v13-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 031eb4a365665edd304f0281ad7e412341504749 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v13 1/1] AVX512 popcount support
---
src/include/port/pg_bitutils.h | 16 +++++++
src/port/Makefile | 1 +
src/port/meson.build | 1 +
src/port/pg_bitutils.c | 53 ++++++++------------
src/port/pg_popcount_avx512.c | 88 ++++++++++++++++++++++++++++++++++
5 files changed, 125 insertions(+), 34 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..4b1e4d92b4 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,22 @@ pg_ceil_log2_64(uint64 num)
#endif
#endif
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#if defined(TRY_POPCNT_FAST) && \
+ defined(HAVE__IMMINTRIN) && \
+ defined(HAVE__AVX512_POPCNT)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..eb1e56fe41 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..c77bbd3168 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..2f9a6690e0 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -142,20 +142,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +161,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..7c595a4b33
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+/*
+ * XXX: Someday we should figure out how to determine whether this file needs
+ * to comiled at configure-time instead of relying on macros that are
+ * determined at compile-time.
+ */
+#ifdef TRY_POPCOUNT_AVX512
+
+/*
+ * Return true if CPUID indicates that the AVX512 POPCNT instruction is
+ * available.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+
+ if ((exx[1] & (1 << 16)) != 0 &&
+ (exx[2] & (1 << 14)) != 0)
+ {
+ /*
+ * We also need to check that the OS has enabled support for the ZMM
+ * registers.
+ */
+#ifdef _MSC_VER
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xe0) != 0;
+#endif
+ }
+
+ return false;
+}
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCOUNT_AVX512 */
--
2.25.1
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, March 27, 2024 3:00 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>... (I realize that I'm essentially
recanting much of my previous feedback, which I apologize for.)
It happens. LOL As long as the algorithm for AVX-512 is not altered I am confident that your new refactor will be fine. :)
Thanks,
Paul
Here is a v14 of the patch that I think is beginning to approach something
committable. Besides general review and testing, there are two things that
I'd like to bring up:
* The latest patch set from Paul Amonson appeared to support MSVC in the
meson build, but not the autoconf one. I don't have much expertise here,
so the v14 patch doesn't have any autoconf/meson support for MSVC, which
I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds
can always compile the x86_64 popcount code, but I don't know whether
that's safe for AVX512.
* I think we need to verify there isn't a huge performance regression for
smaller arrays. IIUC those will still require an AVX512 instruction or
two as well as a function call, which might add some noticeable overhead.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v14-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v14 1/1] AVX512 popcount support
---
config/c-compiler.m4 | 34 +++++++
configure | 165 +++++++++++++++++++++++++++++++++
configure.ac | 34 +++++++
meson.build | 59 ++++++++++++
src/Makefile.global.in | 1 +
src/include/pg_config.h.in | 9 ++
src/include/port/pg_bitutils.h | 20 ++++
src/makefiles/meson.build | 1 +
src/port/Makefile | 6 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 56 ++++-------
src/port/pg_popcount_avx512.c | 98 ++++++++++++++++++++
12 files changed, 451 insertions(+), 38 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for AVX512 popcount intrinsics
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..ced39c9055 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,28 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for AVX512 popcount intrinsics
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..2399b90d6e 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,41 @@ int main(void)
endif
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_avx512_popcnt = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..a6c0c4a692 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..133d8ba071 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,9 +555,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +686,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..c69a85e08e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,12 +298,32 @@ pg_ceil_log2_64(uint64 num)
#endif
#endif
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ *
+ * NB: We assume that there's no hope of AVX512 popcount support if the "fast"
+ * implementations aren't available.
+ */
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..c2345cc95f 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -99,6 +99,7 @@ pgxs_kv = {
'PERMIT_DECLARATION_AFTER_STATEMENT':
' '.join(cflags_no_decl_after_statement),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_CRC': ' '.join(cflags_crc),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..fd2c59aec6 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
@@ -92,6 +93,11 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512.o need CFLAGS_AVX512_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..4e69fe8e91 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512_popcnt'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512_popcnt': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512_popcnt': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..61cd049553 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -142,20 +141,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +160,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
@@ -243,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..66ca92c029
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID_COUNT
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+#include <immintrin.h>
+#endif
+
+#ifdef HAVE__CPUIDEX
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * XXX: Someday we should figure out how to determine whether this file needs
+ * to compiled at configure-time instead of relying on macros that are
+ * determined at compile-time.
+ */
+#ifdef TRY_POPCNT_AVX512
+
+/*
+ * Return true if CPUID indicates that the AVX512 POPCNT instruction is
+ * available.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+
+ if ((exx[1] & (1 << 16)) != 0 && /* avx512f */
+ (exx[2] & (1 << 14)) != 0) /* avx512vpopcntdq */
+ {
+ /*
+ * We also need to check that the OS has enabled support for the ZMM
+ * registers.
+ */
+#ifdef _MSC_VER
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xe0) != 0;
+#endif
+ }
+
+ return false;
+}
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_AVX512 */
--
2.25.1
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote:
Here is a v14 of the patch that I think is beginning to approach something
committable. Besides general review and testing, there are two things that
I'd like to bring up:* The latest patch set from Paul Amonson appeared to support MSVC in the
meson build, but not the autoconf one. I don't have much expertise here,
so the v14 patch doesn't have any autoconf/meson support for MSVC, which
I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds
can always compile the x86_64 popcount code, but I don't know whether
that's safe for AVX512.* I think we need to verify there isn't a huge performance regression for
smaller arrays. IIUC those will still require an AVX512 instruction or
two as well as a function call, which might add some noticeable overhead.
I forgot to mention that I also want to understand whether we can actually
assume availability of XGETBV when CPUID says we support AVX512:
+ /* + * We also need to check that the OS has enabled support for the ZMM + * registers. + */ +#ifdef _MSC_VER + return (_xgetbv(0) & 0xe0) != 0; +#else + uint64 xcr = 0; + uint32 high; + uint32 low; + +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr)); + return (low & 0xe0) != 0; +#endif
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Thursday, March 28, 2024 2:39 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>* The latest patch set from Paul Amonson appeared to support MSVC in the
meson build, but not the autoconf one. I don't have much expertise here,
so the v14 patch doesn't have any autoconf/meson support for MSVC, which
I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds
can always compile the x86_64 popcount code, but I don't know whether
that's safe for AVX512.
I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja so I stuck with that.
* I think we need to verify there isn't a huge performance regression for
smaller arrays. IIUC those will still require an AVX512 instruction or
two as well as a function call, which might add some noticeable overhead.
Not considering your changes, I had already tested small buffers. At less than 512 bytes there was no measurable regression (there was one extra condition check) and for 512+ bytes it moved from no regression to some gains between 512 and 4096 bytes. Assuming you introduced no extra function calls, it should be the same.
I forgot to mention that I also want to understand whether we can actually assume availability of XGETBV when CPUID says we support AVX512:
You cannot assume as there are edge cases where AVX-512 was found on system one during compile but it's not actually available in a kernel on a second system at runtime despite the CPU actually having the hardware feature.
I will review the new patch to see if there are anything that jumps out at me.
Thanks,
Paul
On 2024-Mar-28, Amonson, Paul D wrote:
-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Thursday, March 28, 2024 2:39 PM
To: Amonson, Paul D <paul.d.amonson@intel.com>* The latest patch set from Paul Amonson appeared to support MSVC in the
meson build, but not the autoconf one. I don't have much expertise here,
so the v14 patch doesn't have any autoconf/meson support for MSVC, which
I thought might be okay for now. IIUC we assume that 64-bit/MSVC builds
can always compile the x86_64 popcount code, but I don't know whether
that's safe for AVX512.I also do not know how to integrate MSVC+Autoconf, the CI uses
MSVC+Meson+Ninja so I stuck with that.
We don't do MSVC via autoconf/Make. We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build. See
commit 1301c80b2167. If it builds with Meson, you're good.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)
-----Original Message-----
From: Amonson, Paul D <paul.d.amonson@intel.com>
Sent: Thursday, March 28, 2024 3:03 PM
To: Nathan Bossart <nathandbossart@gmail.com>
...
I will review the new patch to see if there are anything that jumps out at me.
I see in the meson.build you added the new file twice?
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512_popcnt'],
I was putting the file with special flags ONLY in the second section and all seemed to work. :)
Everything else seems good to me.
Thanks,
Paul
On Thu, Mar 28, 2024 at 10:03:04PM +0000, Amonson, Paul D wrote:
* I think we need to verify there isn't a huge performance regression for
smaller arrays. IIUC those will still require an AVX512 instruction or
two as well as a function call, which might add some noticeable overhead.Not considering your changes, I had already tested small buffers. At less
than 512 bytes there was no measurable regression (there was one extra
condition check) and for 512+ bytes it moved from no regression to some
gains between 512 and 4096 bytes. Assuming you introduced no extra
function calls, it should be the same.
Cool. I think we should run the benchmarks again to be safe, though.
I forgot to mention that I also want to understand whether we can
actually assume availability of XGETBV when CPUID says we support
AVX512:You cannot assume as there are edge cases where AVX-512 was found on
system one during compile but it's not actually available in a kernel on
a second system at runtime despite the CPU actually having the hardware
feature.
Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is. Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check
and/or CPUID check. It looks like MSVC has had support for the _xgetbv
intrinsic for quite a while, but I'm still researching the other cases.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
We don't do MSVC via autoconf/Make. We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build. See
commit 1301c80b2167. If it builds with Meson, you're good.
The latest cfbot build for this seems to indicate that at least newer MSVC
knows AVX512 intrinsics without any special compiler flags [0]https://cirrus-ci.com/task/5787206636273664?logs=configure#L159, so maybe
what I had in v14 is good enough. A previous version of the patch set [1]/messages/by-id/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch
had the following lines:
+ if host_system == 'windows'
+ test_flags = ['/arch:AVX512']
+ endif
I'm not sure if this is needed for older MSVC or something else. IIRC I
couldn't find any other examples of this sort of thing in the meson
scripts, either. Paul, do you recall why you added this?
[0]: https://cirrus-ci.com/task/5787206636273664?logs=configure#L159
[1]: /messages/by-id/attachment/158206/v12-0002-Feature-Added-AVX-512-acceleration-to-the-pg_popcoun.patch
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Mar 28, 2024 at 10:29:47PM +0000, Amonson, Paul D wrote:
I see in the meson.build you added the new file twice?
@@ -7,6 +7,7 @@ pgport_sources = [ 'noblock.c', 'path.c', 'pg_bitutils.c', + 'pg_popcount_avx512.c', 'pg_strong_random.c', 'pgcheckdir.c', 'pgmkdirp.c', @@ -84,6 +85,7 @@ replace_funcs_pos = [ ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'], ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'], + ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512_popcnt'],I was putting the file with special flags ONLY in the second section and all seemed to work. :)
Ah, yes, I think that's a mistake, and without looking closely, might
explain the MSVC warnings [0]http://commitfest.cputube.org/highlights/all.html#4883:
[22:05:47.444] pg_popcount_avx512.c.obj : warning LNK4006: pg_popcount_avx512_available already defined in pg_popcount_a...
It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
builds, too, but AFAICT we still need to wrap most of that code with
macros, so I'm not sure it's worth the trouble. I'll take another look at
this...
[0]: http://commitfest.cputube.org/highlights/all.html#4883
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v15-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From c924b57f8479e51aa30c8e3cfe194a2ab85497ff Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v15 1/1] AVX512 popcount support
---
config/c-compiler.m4 | 34 +++++++
configure | 165 +++++++++++++++++++++++++++++++++
configure.ac | 34 +++++++
meson.build | 59 ++++++++++++
src/Makefile.global.in | 1 +
src/include/pg_config.h.in | 9 ++
src/include/port/pg_bitutils.h | 20 ++++
src/makefiles/meson.build | 1 +
src/port/Makefile | 6 ++
src/port/meson.build | 5 +-
src/port/pg_bitutils.c | 56 ++++-------
src/port/pg_popcount_avx512.c | 98 ++++++++++++++++++++
12 files changed, 450 insertions(+), 38 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for AVX512 popcount intrinsics
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_AVX512_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_AVX512_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..ced39c9055 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,18 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+# Check for x86 cpuid_count instruction
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2075,28 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for AVX512 popcount intrinsics
+#
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_AVX512_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..2399b90d6e 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,41 @@ int main(void)
endif
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_avx512_popcnt = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_avx512_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..a6c0c4a692 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_AVX512_POPCNT = @CFLAGS_AVX512_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..133d8ba071 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -555,9 +555,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +686,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..c69a85e08e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,12 +298,32 @@ pg_ceil_log2_64(uint64 num)
#endif
#endif
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ *
+ * NB: We assume that there's no hope of AVX512 popcount support if the "fast"
+ * implementations aren't available.
+ */
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..c2345cc95f 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -99,6 +99,7 @@ pgxs_kv = {
'PERMIT_DECLARATION_AFTER_STATEMENT':
' '.join(cflags_no_decl_after_statement),
+ 'CFLAGS_AVX512_POPCNT': ' '.join(cflags_avx512_popcnt),
'CFLAGS_CRC': ' '.join(cflags_crc),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..fd2c59aec6 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
@@ -92,6 +93,11 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512.o need CFLAGS_AVX512_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..6f34c837c2 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,7 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'avx512_popcnt'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +99,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'avx512_popcnt': cflags_avx512_popcnt}
+pgport_sources_cflags = {'crc': [], 'avx512_popcnt': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..61cd049553 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -142,20 +141,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +160,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
@@ -243,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..66ca92c029
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID_COUNT
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+#include <immintrin.h>
+#endif
+
+#ifdef HAVE__CPUIDEX
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * XXX: Someday we should figure out how to determine whether this file needs
+ * to compiled at configure-time instead of relying on macros that are
+ * determined at compile-time.
+ */
+#ifdef TRY_POPCNT_AVX512
+
+/*
+ * Return true if CPUID indicates that the AVX512 POPCNT instruction is
+ * available.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+
+ if ((exx[1] & (1 << 16)) != 0 && /* avx512f */
+ (exx[2] & (1 << 14)) != 0) /* avx512vpopcntdq */
+ {
+ /*
+ * We also need to check that the OS has enabled support for the ZMM
+ * registers.
+ */
+#ifdef _MSC_VER
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xe0) != 0;
+#endif
+ }
+
+ return false;
+}
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_AVX512 */
--
2.25.1
-----Original Message-----
Cool. I think we should run the benchmarks again to be safe, though.
Ok, sure go ahead. :)
I forgot to mention that I also want to understand whether we can
actually assume availability of XGETBV when CPUID says we support
AVX512:You cannot assume as there are edge cases where AVX-512 was found on
system one during compile but it's not actually available in a kernel
on a second system at runtime despite the CPU actually having the
hardware feature.Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is. Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check and/or
CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic for
quite a while, but I'm still researching the other cases.
I see google web references to the xgetbv instruction as far back as 2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for _xgetbv() MSVC built-in. How far back do you need to go?
Thanks,
Paul
On Fri, Mar 29, 2024 at 04:06:17PM +0000, Amonson, Paul D wrote:
Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is. Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check and/or
CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic for
quite a while, but I'm still researching the other cases.I see google web references to the xgetbv instruction as far back as 2009
for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
_xgetbv() MSVC built-in. How far back do you need to go?
Hm. It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then. I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.
It would probably be easy enough to add a couple of tests for this, but if
we don't have reason to believe there's any practical case to do so, I
don't know why we would. I'm curious what others think about this.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote:
It might be nice if we conditionally built pg_popcount_avx512.o in autoconf
builds, too, but AFAICT we still need to wrap most of that code with
macros, so I'm not sure it's worth the trouble. I'll take another look at
this...
If we assumed that TRY_POPCNT_FAST would be set and either
HAVE__GET_CPUID_COUNT or HAVE__CPUIDEX would be set whenever
USE_AVX512_POPCNT_WITH_RUNTIME_CHECK is set, we could probably remove the
surrounding macros and just compile pg_popcount_avx512.c conditionally
based on USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. However, the surrounding
code seems to be pretty cautious about these assumptions (e.g., the CPUID
macros are checked before setting TRY_POPCNT_FAST), so this would stray
from the nearby precedent a bit.
A counterexample is the CRC32C code. AFAICT we assume the presence of
CPUID in that code (and #error otherwise). I imagine its probably safe to
assume the compiler understands CPUID if it understands AVX512 intrinsics,
but that is still mostly a guess.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
I see google web references to the xgetbv instruction as far back as 2009
for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
_xgetbv() MSVC built-in. How far back do you need to go?
Hm. It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then. I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.
Some quick googling says that (1) XGETBV predates AVX and (2) if you
are worried about old CPUs, you should check CPUID to verify whether
XGETBV exists before trying to use it. I did not look for the
bit-level details on how to do that.
regards, tom lane
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Friday, March 29, 2024 9:17 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
On Fri, Mar 29, 2024 at 04:06:17PM +0000, Amonson, Paul D wrote:
Yeah, I understand that much, but I want to know how portable the
XGETBV instruction is. Unless I can assume that all x86_64 systems
and compilers support that instruction, we might need an additional
configure check and/or CPUID check. It looks like MSVC has had
support for the _xgetbv intrinsic for quite a while, but I'm still researching the other cases.I see google web references to the xgetbv instruction as far back as
2009 for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could
test for
_xgetbv() MSVC built-in. How far back do you need to go?
Hm. It seems unlikely that a compiler would understand AVX512 intrinsics and not XGETBV then. I guess the other question is whether CPUID indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.
It would probably be easy enough to add a couple of tests for this, but if we don't have reason to believe there's any practical case to do so, I don't know why we would. I'm curious what others think about this.
This seems unlikely. Machines supporting XGETBV would support AVX512 intrinsics. Xgetbv instruction seems to be part of xsave feature set as per intel developer manual [2]https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf. XGETBV/XSAVE came first, and seems to be available in all x86 systems available since 2011, since Intel SandyBridge architecture and AMD the Opteron Gen4 [0]https://kb.vmware.com/s/article/1005764.
AVX512 first came into a product in 2016 [1]https://en.wikipedia.org/wiki/AVX-512
[0]: https://kb.vmware.com/s/article/1005764
[1]: https://en.wikipedia.org/wiki/AVX-512
[2]: https://cdrdv2-public.intel.com/774475/252046-sdm-change-document.pdf
- Akash Shankaran
On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
I see google web references to the xgetbv instruction as far back as 2009
for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for
_xgetbv() MSVC built-in. How far back do you need to go?Hm. It seems unlikely that a compiler would understand AVX512 intrinsics
and not XGETBV then. I guess the other question is whether CPUID
indicating AVX512 is enabled implies the availability of XGETBV on the CPU.
If that's not safe, we might need to add another CPUID test.Some quick googling says that (1) XGETBV predates AVX and (2) if you
are worried about old CPUs, you should check CPUID to verify whether
XGETBV exists before trying to use it. I did not look for the
bit-level details on how to do that.
That extra CPUID check should translate to exactly one additional line of
code, so I think I'm inclined to just add it.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote:
We don't do MSVC via autoconf/Make. We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build. See
commit 1301c80b2167. If it builds with Meson, you're good.The latest cfbot build for this seems to indicate that at least newer MSVC
knows AVX512 intrinsics without any special compiler flags [0], so maybe
what I had in v14 is good enough. A previous version of the patch set [1] had
the following lines:+ if host_system == 'windows' + test_flags = ['/arch:AVX512'] + endifI'm not sure if this is needed for older MSVC or something else. IIRC I couldn't
find any other examples of this sort of thing in the meson scripts, either. Paul,
do you recall why you added this?
I asked internal folks here in-the-know and they suggested I add it. I personally am not a Windows guy. If it works without it and you are comfortable not including the lines, I am fine with it.
Thanks,
Paul
A counterexample is the CRC32C code. AFAICT we assume the presence of
CPUID in that code (and #error otherwise). I imagine its probably safe to
assume the compiler understands CPUID if it understands AVX512 intrinsics,
but that is still mostly a guess.
If AVX-512 intrinsics are available, then yes you will have CPUID. CPUID is much older in the hardware/software timeline than AVX-512.
Thanks,
Paul
Okay, here is a slightly different approach that I've dubbed the "maximum
assumption" approach. In short, I wanted to see how much we could simplify
the patch by making all possibly-reasonable assumptions about the compiler
and CPU. These include:
* If the compiler understands AVX512 intrinsics, we assume that it also
knows about the required CPUID and XGETBV intrinsics, and we assume that
the conditions for TRY_POPCNT_FAST are true.
* If this is x86_64, CPUID will be supported by the CPU.
* If CPUID indicates AVX512 POPCNT support, the CPU also supports XGETBV.
Do any of these assumptions seem unreasonable or unlikely to be true for
all practical purposes? I don't mind adding back some or all of the
configure/runtime checks if they seem necessary. I guess the real test
will be the buildfarm...
Another big change in this version is that I've moved
pg_popcount_avx512_available() to its own file so that we only compile
pg_popcount_avx512() with the special compiler flags. This is just an
oversight in previous versions.
Finally, I've modified the build scripts so that the AVX512 popcount stuff
is conditionally built based on the configure checks for both
autoconf/meson.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v16-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From d7864391c455ea77b8e555e40a358c59de1bd702 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v16 1/1] AVX512 popcount support
---
config/c-compiler.m4 | 34 +++++++++
configure | 100 +++++++++++++++++++++++++++
configure.ac | 14 ++++
meson.build | 35 ++++++++++
src/Makefile.global.in | 4 ++
src/include/pg_config.h.in | 3 +
src/include/port/pg_bitutils.h | 17 +++++
src/makefiles/meson.build | 3 +-
src/port/Makefile | 6 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 56 ++++++---------
src/port/pg_popcount_avx512.c | 40 +++++++++++
src/port/pg_popcount_avx512_choose.c | 61 ++++++++++++++++
13 files changed, 340 insertions(+), 39 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..7d13368b23 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..86c471f4ec 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+CFLAGS_POPCNT
+PG_POPCNT_OBJS
LIBOBJS
OPENSSL
ZSTD
@@ -17438,6 +17440,104 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+# Check for AVX512 popcount intrinsics
+#
+PG_POPCNT_OBJS=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..b1aebb8583 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2063,6 +2063,20 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+# Check for AVX512 popcount intrinsics
+#
+PG_POPCNT_OBJS=""
+PGAC_AVX512_POPCNT_INTRINSICS([])
+if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+fi
+if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+fi
+AC_SUBST(PG_POPCNT_OBJS)
+AC_SUBST(CFLAGS_POPCNT)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..fbd2aa3dbd 100644
--- a/meson.build
+++ b/meson.build
@@ -1996,6 +1996,41 @@ int main(void)
endif
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..dec467b7dd 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +759,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..c271c06b74 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -680,6 +680,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..fc8d34ad25 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,12 +298,29 @@ pg_ceil_log2_64(uint64 num)
#endif
#endif
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ *
+ * NB: We assume that the availability of AVX512 intrinsics implies
+ * TRY_POPCNT_FAST.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#ifdef TRY_POPCNT_FAST
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5a592ddaee 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -99,6 +99,7 @@ pgxs_kv = {
'PERMIT_DECLARATION_AFTER_STATEMENT':
' '.join(cflags_no_decl_after_statement),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_CRC': ' '.join(cflags_crc),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
@@ -177,7 +178,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..7e154ac379 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,11 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..7b93233428 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt}
+pgport_sources_cflags = {'crc': [], 'popcnt': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..ada3e777f7 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -142,20 +141,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef USE_AVX512_POPCOUNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +160,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
@@ -243,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..c39db13f85
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..62ebc515ce
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2019-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * Return true if CPUID indicates that the AVX512 POPCNT instruction is
+ * available.
+ *
+ * NB: We assume the availability of AVX512 intrinsics implies availability of
+ * the required CPUID and XGETBV intrinsics.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+
+ if ((exx[2] & (1 << 14)) != 0) /* avx512vpopcntdq */
+ {
+ /* Check that the OS has enabled support for the ZMM registers. */
+#ifdef _MSC_VER
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ uint64 xcr = 0;
+ uint32 high;
+ uint32 low;
+
+__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
+ return (low & 0xe0) != 0;
+#endif
+ }
+
+ return false;
+}
--
2.25.1
On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote:
* If the compiler understands AVX512 intrinsics, we assume that it also
knows about the required CPUID and XGETBV intrinsics, and we assume that
the conditions for TRY_POPCNT_FAST are true.
Bleh, cfbot's 32-bit build is unhappy with this [0]https://cirrus-ci.com/task/5475113447981056. It looks like it's
trying to build the AVX512 stuff, but TRY_POPCNT_FAST isn't set.
[19:39:11.306] ../src/port/pg_popcount_avx512.c:39:18: warning: implicit declaration of function ‘pg_popcount_fast’; did you mean ‘pg_popcount’? [-Wimplicit-function-declaration]
[19:39:11.306] 39 | return popcnt + pg_popcount_fast(buf, bytes);
[19:39:11.306] | ^~~~~~~~~~~~~~~~
[19:39:11.306] | pg_popcount
There's also a complaint about the inline assembly:
[19:39:11.443] ../src/port/pg_popcount_avx512_choose.c:55:1: error: inconsistent operand constraints in an ‘asm’
[19:39:11.443] 55 | __asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
[19:39:11.443] | ^~~~~~~
I'm looking into this...
+#if defined(HAVE__GET_CPUID) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuidex(exx, 7, 0);
Is there any reason we can't use __get_cpuid() and __cpuid() here, given
the sub-leaf is 0?
[0]: https://cirrus-ci.com/task/5475113447981056
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote:
+#if defined(HAVE__GET_CPUID) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuidex(exx, 7, 0);Is there any reason we can't use __get_cpuid() and __cpuid() here, given
the sub-leaf is 0?
The answer to this seems to be "no." After additional research,
__get_cpuid_count/__cpuidex seem new enough that we probably want configure
checks for them, so I'll add those back in the next version of the patch.
Apologies for the stream of consciousness today...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Here's a v17 of the patch. This one has configure checks for everything
(i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant
runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT
availability, and we call XGETBV to ensure the ZMM registers are enabled).
I restricted the AVX512 configure checks to x86_64 since we know we won't
have TRY_POPCNT_FAST on 32-bit, and we rely on pg_popcount_fast() as our
fallback implementation in the AVX512 version. Finally, I removed the
inline assembly in favor of using the _xgetbv() intrinsic on all systems.
It looks like that's available on gcc, clang, and msvc, although it
sometimes requires -mxsave, so that's applied to
pg_popcount_avx512_choose.o as needed. I doubt this will lead to SIGILLs,
but it's admittedly a little shaky.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v17-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From a26b209927cc6b266b33f74fd734772eff87bff9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v17 1/1] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 56 +++---
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
13 files changed, 638 insertions(+), 39 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1a92c56bcd 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..c08d3c056f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -142,20 +141,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +160,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
@@ -243,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
--
2.25.1
I used John Naylor's test_popcount module [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to put together the attached
graphs (note that the "small arrays" one is semi-logarithmic). For both
graphs, the X-axis is the number of 64-bit words in the array, and Y-axis
is the amount of time in milliseconds to run pg_popcount() on it 100,000
times (along with a bit of overhead). This test didn't show any
regressions with a relatively small number of bytes, and it showed the
expected improvements with many bytes.
There isn't a ton of use of pg_popcount() in Postgres, but I do see a few
places that call it with enough bytes for the AVX512 optimization to take
effect. There may be more callers in the future, though, and it seems
generally useful to have some of the foundational work for using AVX512
instructions in place. My current plan is to add some new tests for
pg_popcount() with many bytes, and then I'll give it a few more days for
any additional feedback before committing.
[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote:
My current plan is to add some new tests for
pg_popcount() with many bytes, and then I'll give it a few more days for
any additional feedback before committing.
Here is a v18 with a couple of new tests. Otherwise, it is the same as
v17.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v18-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 86a571721ed3ed4ca7e04134b9541fc3ac43b9f1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v18 1/1] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 56 +++---
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 666 insertions(+), 39 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1a92c56bcd 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..c08d3c056f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -142,20 +141,18 @@ pg_popcount_available(void)
return (exx[2] & (1 << 23)) != 0; /* POPCNT */
}
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -163,45 +160,32 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect whether we can use the asm implementations, and replace
+ * the function pointers so that subsequent calls are routed directly to
+ * the chosen implementation.
+ */
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
@@ -243,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
On 2024-Mar-31, Nathan Bossart wrote:
+uint64 +pg_popcount_avx512(const char *buf, int bytes) +{ + uint64 popcnt; + __m512i accum = _mm512_setzero_si512(); + + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) + { + const __m512i val = _mm512_loadu_si512((const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + + accum = _mm512_add_epi64(accum, cnt); + buf += sizeof(__m512i); + } + + popcnt = _mm512_reduce_add_epi64(accum); + return popcnt + pg_popcount_fast(buf, bytes); +}
Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here? Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that. (At least, maybe avoid the call if bytes is 0, no?)
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
On 2024-Mar-31, Nathan Bossart wrote:
+ popcnt = _mm512_reduce_add_epi64(accum); + return popcnt + pg_popcount_fast(buf, bytes);Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here? Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that. (At least, maybe avoid the call if bytes is 0, no?)
Yes, it does. I did another benchmark on very small arrays and can see the
overhead. This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:
size (bytes) HEAD AVX512-POPCNT
1 1707.685 3480.424
2 1926.694 4606.182
4 3210.412 5284.506
8 1920.703 3640.968
16 2936.91 4045.586
32 3627.956 5538.418
64 5347.213 3748.212
I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.
We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching. The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.
Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte. I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway). But this is all to save a
couple of nanoseconds...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
On 2024-Mar-31, Nathan Bossart wrote:
+ popcnt = _mm512_reduce_add_epi64(accum); + return popcnt + pg_popcount_fast(buf, bytes);Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here? Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that. (At least, maybe avoid the call if bytes is 0, no?)Yes, it does. I did another benchmark on very small arrays and can see the
overhead. This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:size (bytes) HEAD AVX512-POPCNT
1 1707.685 3480.424
2 1926.694 4606.182
4 3210.412 5284.506
8 1920.703 3640.968
16 2936.91 4045.586
32 3627.956 5538.418
64 5347.213 3748.212I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.
Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.
Regards,
Ants Aasma
Attachments:
avx512-popcnt-masked-tail.patchtext/x-patch; charset=US-ASCII; name=avx512-popcnt-masked-tail.patchDownload
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
- uint64 popcnt;
+ __m512i val, cnt;
+ __mmask64 remaining_mask;
__m512i accum = _mm512_setzero_si512();
- for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ #pragma clang loop unroll(disable)
+ for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
{
- const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
+ val = _mm512_loadu_si512((const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
accum = _mm512_add_epi64(accum, cnt);
buf += sizeof(__m512i);
}
- popcnt = _mm512_reduce_add_epi64(accum);
- return popcnt + pg_popcount_fast(buf, bytes);
+ remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+ val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
}
#endif /* TRY_POPCNT_FAST */
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.
Sounds promising. IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions. I
searched around a little bit but haven't found anything that seemed
definitive.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.Sounds promising. IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions. I
searched around a little bit but haven't found anything that seemed
definitive.
Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1]https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.
Regards,
Ants Aasma
[1]: https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf
Here is a v19 of the patch set. I moved out the refactoring of the
function pointer selection code to 0001. I think this is a good change
independent of $SUBJECT, and I plan to commit this soon. In 0002, I
changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
instead. This is standard practice elsewhere where the popcount functions
are unlikely to win. I'll probably commit this one soon, too, as it's even
more trivial than 0001.
0003 is the AVX512 POPCNT patch. Besides refactoring out 0001, there are
no changes from v18. 0004 is an early proof-of-concept for using AVX512
for the visibility map code. The code is missing comments, and I haven't
performed any benchmarking yet, but I figured I'd post it because it
demonstrates how it's possible to build upon 0003 in other areas.
AFAICT the main open question is the function call overhead in 0003 that
Alvaro brought up earlier. After 0002 is committed, I believe the only
in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
not sure it's worth expending too much energy to make sure there are
absolutely no regressions there. However, I'm happy to do so if folks feel
that it is necessary, and I'd be grateful for thoughts on how to proceed on
this one.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v19-0001-refactor-popcount-function-choosing.patchtext/x-diff; charset=us-asciiDownload
From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 1 Apr 2024 16:37:53 -0500
Subject: [PATCH v19 1/4] refactor popcount function choosing
---
src/port/pg_bitutils.c | 37 +++++++++----------------------------
1 file changed, 9 insertions(+), 28 deletions(-)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..28312f3dd9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -148,8 +148,8 @@ pg_popcount_available(void)
* the function pointers so that subsequent calls are routed directly to
* the chosen implementation.
*/
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
{
if (pg_popcount_available())
{
@@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word)
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
}
+}
+static int
+pg_popcount32_choose(uint32 word)
+{
+ choose_popcount_functions();
return pg_popcount32(word);
}
static int
pg_popcount64_choose(uint64 word)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount64(word);
}
static uint64
pg_popcount_choose(const char *buf, int bytes)
{
- if (pg_popcount_available())
- {
- pg_popcount32 = pg_popcount32_fast;
- pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
- }
- else
- {
- pg_popcount32 = pg_popcount32_slow;
- pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
- }
-
+ choose_popcount_functions();
return pg_popcount(buf, bytes);
}
--
2.25.1
v19-0002-use-pg_number_of_ones-instead-of-pg_popcount-for.patchtext/x-diff; charset=us-asciiDownload
From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 1 Apr 2024 16:47:22 -0500
Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for
single byte
---
src/backend/postmaster/syslogger.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..437947dbb9 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
p.pid != 0 &&
- pg_popcount((char *) &dest_flags, 1) == 1)
+ pg_number_of_ones[dest_flags] == 1)
{
List *buffer_list;
ListCell *cell;
--
2.25.1
v19-0003-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v19 3/4] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 7 +-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 651 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1a92c56bcd 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..177509518f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +155,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -224,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v19-0004-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 8ea529dd315723ca3e8ad4243853148da23f1202 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v19 4/4] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 ++----
src/include/port/pg_bitutils.h | 6 +-
src/port/pg_bitutils.c | 113 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 21 +++++
4 files changed, 144 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1a92c56bcd..16145c746e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_masked(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 177509518f..902ecdebbf 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,18 +106,21 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -155,9 +158,13 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+ pg_popcount_masked = pg_popcount_masked_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount = pg_popcount_avx512;
+ pg_popcount_masked = pg_popcount_masked_avx512;
+ }
#endif
}
else
@@ -165,6 +172,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
+ pg_popcount_masked = pg_popcount_masked_slow;
}
}
@@ -189,6 +197,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -269,6 +284,52 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -368,6 +429,52 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -399,4 +506,10 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..8965a8d530 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,25 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote:
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.Sounds promising. IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions. I
searched around a little bit but haven't found anything that seemed
definitive.Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.
Perhaps Paul or Akash could chime in here...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote:
Here is a v19 of the patch set. I moved out the refactoring of the
function pointer selection code to 0001. I think this is a good change
independent of $SUBJECT, and I plan to commit this soon. In 0002, I
changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
instead. This is standard practice elsewhere where the popcount functions
are unlikely to win. I'll probably commit this one soon, too, as it's even
more trivial than 0001.0003 is the AVX512 POPCNT patch. Besides refactoring out 0001, there are
no changes from v18. 0004 is an early proof-of-concept for using AVX512
for the visibility map code. The code is missing comments, and I haven't
performed any benchmarking yet, but I figured I'd post it because it
demonstrates how it's possible to build upon 0003 in other areas.
I've committed the first two patches, and I've attached a rebased version
of the latter two.
AFAICT the main open question is the function call overhead in 0003 that
Alvaro brought up earlier. After 0002 is committed, I believe the only
in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
not sure it's worth expending too much energy to make sure there are
absolutely no regressions there. However, I'm happy to do so if folks feel
that it is necessary, and I'd be grateful for thoughts on how to proceed on
this one.
Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:
static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64 popcnt = 0;
while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];
return popcnt;
}
#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
pg_popcount_inline(buf, bytes) : \
pg_popcount_optimized(buf, bytes))
But again, I'm not sure this is really worth it for the current use-cases.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v20-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 3c5c3fdaffd623b513bcc476ee7c15f6379af1e7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v20 1/2] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 7 +-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 651 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1a92c56bcd 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..177509518f 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,6 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +155,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount = pg_popcount_avx512;
+#endif
}
else
{
@@ -224,7 +227,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v20-0002-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 6c08e5bc0000fcb7a3cbd485e012b049a671c00b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v20 2/2] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 ++----
src/include/port/pg_bitutils.h | 6 +-
src/port/pg_bitutils.c | 113 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 21 +++++
4 files changed, 144 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1a92c56bcd..bafb20a427 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_masked(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 177509518f..902ecdebbf 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,18 +106,21 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -155,9 +158,13 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount = pg_popcount_fast;
+ pg_popcount_masked = pg_popcount_masked_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount = pg_popcount_avx512;
+ pg_popcount_masked = pg_popcount_masked_avx512;
+ }
#endif
}
else
@@ -165,6 +172,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount = pg_popcount_slow;
+ pg_popcount_masked = pg_popcount_masked_slow;
}
}
@@ -189,6 +197,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -269,6 +284,52 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -368,6 +429,52 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -399,4 +506,10 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..8965a8d530 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,25 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On 2024-Apr-02, Nathan Bossart wrote:
Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64 popcnt = 0;while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];return popcnt;
}#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
pg_popcount_inline(buf, bytes) : \
pg_popcount_optimized(buf, bytes))But again, I'm not sure this is really worth it for the current use-cases.
Eh, that seems simple enough, and then you can forget about that case.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
On 2024-Apr-02, Nathan Bossart wrote:
Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64 popcnt = 0;while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];return popcnt;
}#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
pg_popcount_inline(buf, bytes) : \
pg_popcount_optimized(buf, bytes))But again, I'm not sure this is really worth it for the current use-cases.
Eh, that seems simple enough, and then you can forget about that case.
I don't like the double evaluation of the macro argument. Seems like
you could get the same results more safely with
static inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64 popcnt = 0;
while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];
return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}
regards, tom lane
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
On 2024-Apr-02, Nathan Bossart wrote:
Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64 popcnt = 0;while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];return popcnt;
}#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
pg_popcount_inline(buf, bytes) : \
pg_popcount_optimized(buf, bytes))But again, I'm not sure this is really worth it for the current use-cases.
Eh, that seems simple enough, and then you can forget about that case.
I don't like the double evaluation of the macro argument. Seems like
you could get the same results more safely withstatic inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64 popcnt = 0;while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}
Yeah, I like that better. I'll do some testing to see what the threshold
really should be before posting an actual patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.Sounds promising. IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions. I
searched around a little bit but haven't found anything that seemed
definitive.
After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.
Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.
Regards,
Ants Aasma
Attachments:
avx512-popcnt-aligned-and-masked.patchtext/x-patch; charset=US-ASCII; name=avx512-popcnt-aligned-and-masked.patchDownload
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
- uint64 popcnt;
+ __m512i val, cnt;
__m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 mask = -1;
- for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
- {
- const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration
+ * is not empty.
+ */
+ mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
accum = _mm512_add_epi64(accum, cnt);
- buf += sizeof(__m512i);
+
+ mask = -1;
}
- popcnt = _mm512_reduce_add_epi64(accum);
- return popcnt + pg_popcount_fast(buf, bytes);
+ /* Final iteration needs to ignore bytes that are not within the length */
+ mask &= ((~0ULL) >> (64 - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
}
#endif /* TRY_POPCNT_FAST */
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote:
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
I don't like the double evaluation of the macro argument. Seems like
you could get the same results more safely withstatic inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64 popcnt = 0;while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}Yeah, I like that better. I'll do some testing to see what the threshold
really should be before posting an actual patch.
My testing shows that inlining wins with fewer than 8 bytes for the current
"fast" implementation. The "fast" implementation wins with fewer than 64
bytes compared to the AVX-512 implementation. These results are pretty
intuitive because those are the points at which the optimizations kick in.
In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes. All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v21-0001-inline-pg_popcount-for-small-numbers-of-bytes.patchtext/x-diff; charset=us-asciiDownload
From ce1180d557cbdf8cff33842ea2f1a22ba6676725 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v21 1/3] inline pg_popcount for small numbers of bytes
---
src/include/port/pg_bitutils.h | 34 ++++++++++++++++++++++++++++++++--
src/port/pg_bitutils.c | 10 +++++-----
2 files changed, 37 insertions(+), 7 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case. If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+ /*
+ * We use 8 bytes as the threshold because that's where we'll first use
+ * special instructions on 64-bit systems. A threshold of 4 bytes might
+ * make more sense on 32-bit systems, but it seems unlikely to make a
+ * tremendous difference.
+ */
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+ return popcnt;
+ }
+
+ return pg_popcount_optimized(buf, bytes);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..4720f8e419 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
+ pg_popcount_optimized = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
+ pg_popcount_optimized = pg_popcount_slow;
}
}
@@ -183,7 +183,7 @@ static uint64
pg_popcount_choose(const char *buf, int bytes)
{
choose_popcount_functions();
- return pg_popcount(buf, bytes);
+ return pg_popcount_optimized(buf, bytes);
}
/*
@@ -391,7 +391,7 @@ pg_popcount64(uint64 word)
* Returns the number of 1-bits in buf
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
{
return pg_popcount_slow(buf, bytes);
}
--
2.25.1
v21-0002-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 17d539a243acba83a2d9f295e072f26fbd154046 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v21 2/3] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 29 ++-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 673 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1f487a4bc3..e4e96952b7 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 4720f8e419..b02721b7e5 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,10 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +159,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_fast_or_avx512;
+#endif
}
else
{
@@ -224,7 +231,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -266,6 +273,24 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * This is a wrapper function for pg_popcount_avx512() that uses
+ * pg_popcount_fast() when there aren't enough bytes to fit in an AVX-512
+ * register. The compiler should be able to inline pg_popcount_fast() so that
+ * we only take on additional function call overhead when it's likely to be a
+ * better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_fast_or_avx512(const char *buf, int bytes)
+{
+ if (bytes < 64)
+ return pg_popcount_fast(buf, bytes);
+ else
+ return pg_popcount_avx512(buf, bytes);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v21-0003-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 49673b12e180ca47d8598b8ba821ece7f1338843 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v21 3/3] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +----
src/include/port/pg_bitutils.h | 27 ++++-
src/port/pg_bitutils.c | 143 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 25 +++++
4 files changed, 199 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index e4e96952b7..5d79e629c5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +361,27 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index b02721b7e5..0e4ed11a50 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,22 +106,26 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask);
#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -161,7 +165,10 @@ choose_popcount_functions(void)
pg_popcount_optimized = pg_popcount_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_fast_or_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast_or_avx512;
+ }
#endif
}
else
@@ -169,6 +176,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
}
@@ -193,6 +201,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -291,6 +306,74 @@ pg_popcount_fast_or_avx512(const char *buf, int bytes)
}
#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
+/*
+ * This is a wrapper function for pg_popcount_masked_avx512() that uses
+ * pg_popcount_masked_fast() when there aren't enough bytes to fit in an
+ * AVX-512 register. The compiler should be able to inline
+ * pg_popcount_masked_fast() so that we only take on additional function call
+ * overhead when it's likely to be a better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 64)
+ return pg_popcount_masked_fast(buf, bytes, mask);
+ else
+ return pg_popcount_masked_avx512(buf, bytes, mask);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
@@ -390,6 +473,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -421,4 +554,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..fb9ab3313b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,29 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote:
In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes. All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.
Sorry for the noise. I noticed a couple of silly mistakes immediately
after sending v21.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v22-0001-inline-pg_popcount-for-small-numbers-of-bytes.patchtext/x-diff; charset=us-asciiDownload
From cfc5e9fe77f96225ec67a044377b10113c98ce0d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v22 1/3] inline pg_popcount for small numbers of bytes
---
src/include/port/pg_bitutils.h | 34 ++++++++++++++++++++++++++++++++--
src/port/pg_bitutils.c | 12 ++++++------
2 files changed, 38 insertions(+), 8 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case. If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+ /*
+ * We use 8 bytes as the threshold because that's where we'll first use
+ * special instructions on 64-bit systems. A threshold of 4 bytes might
+ * make more sense on 32-bit systems, but it seems unlikely to make a
+ * tremendous difference.
+ */
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+ return popcnt;
+ }
+
+ return pg_popcount_optimized(buf, bytes);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
+ pg_popcount_optimized = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
+ pg_popcount_optimized = pg_popcount_slow;
}
}
@@ -183,7 +183,7 @@ static uint64
pg_popcount_choose(const char *buf, int bytes)
{
choose_popcount_functions();
- return pg_popcount(buf, bytes);
+ return pg_popcount_optimized(buf, bytes);
}
/*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
}
/*
- * pg_popcount
+ * pg_popcount_optimized
* Returns the number of 1-bits in buf
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
{
return pg_popcount_slow(buf, bytes);
}
--
2.25.1
v22-0002-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From a8024ebcc54b4ac0d3d145ade5d7cd85eb192afc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v22 2/3] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 29 ++-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 673 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1f487a4bc3..e4e96952b7 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..2fa16b54b8 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,10 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +159,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_fast_or_avx512;
+#endif
}
else
{
@@ -224,7 +231,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -266,6 +273,24 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * This is a wrapper function for pg_popcount_avx512() that uses
+ * pg_popcount_fast() when there aren't enough bytes to fit in an AVX-512
+ * register. The compiler should be able to inline pg_popcount_fast() so that
+ * we only take on additional function call overhead when it's likely to be a
+ * better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_fast_or_avx512(const char *buf, int bytes)
+{
+ if (bytes < 64)
+ return pg_popcount_fast(buf, bytes);
+ else
+ return pg_popcount_avx512(buf, bytes);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v22-0003-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 543e9f9b8b83de2429f7019bbdbc18b2509882c9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v22 3/3] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +----
src/include/port/pg_bitutils.h | 27 ++++-
src/port/pg_bitutils.c | 143 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 25 +++++
4 files changed, 199 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index e4e96952b7..5d79e629c5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +361,27 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2fa16b54b8..8beb70f62b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,22 +106,26 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask);
#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -161,7 +165,10 @@ choose_popcount_functions(void)
pg_popcount_optimized = pg_popcount_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_fast_or_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast_or_avx512;
+ }
#endif
}
else
@@ -169,6 +176,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
}
@@ -193,6 +201,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -291,6 +306,74 @@ pg_popcount_fast_or_avx512(const char *buf, int bytes)
}
#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
+/*
+ * This is a wrapper function for pg_popcount_masked_avx512() that uses
+ * pg_popcount_masked_fast() when there aren't enough bytes to fit in an
+ * AVX-512 register. The compiler should be able to inline
+ * pg_popcount_masked_fast() so that we only take on additional function call
+ * overhead when it's likely to be a better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 64)
+ return pg_popcount_masked_fast(buf, bytes, mask);
+ else
+ return pg_popcount_masked_avx512(buf, bytes, mask);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
@@ -390,6 +473,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -421,4 +554,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..fb9ab3313b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,29 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote:
Sorry for the noise. I noticed a couple of silly mistakes immediately
after sending v21.
Sigh... I missed a line while rebasing these patches, which seems to have
grossly offended cfbot. Apologies again for the noise.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v23-0001-inline-pg_popcount-for-small-numbers-of-bytes.patchtext/x-diff; charset=us-asciiDownload
From bfe2b3158378fd822c17fb251178df7557065cfd Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v23 1/3] inline pg_popcount for small numbers of bytes
---
src/include/port/pg_bitutils.h | 34 ++++++++++++++++++++++++++++++++--
src/port/pg_bitutils.c | 12 ++++++------
2 files changed, 38 insertions(+), 8 deletions(-)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
/* Attempt to use the POPCNT instruction, but perform a runtime check first */
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
#endif /* TRY_POPCNT_FAST */
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case. If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+ /*
+ * We use 8 bytes as the threshold because that's where we'll first use
+ * special instructions on 64-bit systems. A threshold of 4 bytes might
+ * make more sense on 32-bit systems, but it seems unlikely to make a
+ * tremendous difference.
+ */
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++];
+ return popcnt;
+ }
+
+ return pg_popcount_optimized(buf, bytes);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64 (*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
{
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
- pg_popcount = pg_popcount_fast;
+ pg_popcount_optimized = pg_popcount_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
- pg_popcount = pg_popcount_slow;
+ pg_popcount_optimized = pg_popcount_slow;
}
}
@@ -183,7 +183,7 @@ static uint64
pg_popcount_choose(const char *buf, int bytes)
{
choose_popcount_functions();
- return pg_popcount(buf, bytes);
+ return pg_popcount_optimized(buf, bytes);
}
/*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
}
/*
- * pg_popcount
+ * pg_popcount_optimized
* Returns the number of 1-bits in buf
*/
uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
{
return pg_popcount_slow(buf, bytes);
}
--
2.25.1
v23-0002-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From da744d0614021cf002e4d9e292e5c874bd81a84e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v23 2/3] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 29 ++-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 673 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1f487a4bc3..e4e96952b7 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..2fa16b54b8 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,10 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +159,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_fast_or_avx512;
+#endif
}
else
{
@@ -224,7 +231,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -266,6 +273,24 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * This is a wrapper function for pg_popcount_avx512() that uses
+ * pg_popcount_fast() when there aren't enough bytes to fit in an AVX-512
+ * register. The compiler should be able to inline pg_popcount_fast() so that
+ * we only take on additional function call overhead when it's likely to be a
+ * better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_fast_or_avx512(const char *buf, int bytes)
+{
+ if (bytes < 64)
+ return pg_popcount_fast(buf, bytes);
+ else
+ return pg_popcount_avx512(buf, bytes);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v23-0003-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From fc2d7e02b154df54dc4dcc7788a77e62d1665f7a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v23 3/3] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +---
src/include/port/pg_bitutils.h | 27 ++++-
src/port/pg_bitutils.c | 144 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 25 ++++
4 files changed, 200 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index e4e96952b7..5d79e629c5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +361,27 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 8)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2fa16b54b8..d646bce230 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,22 +106,26 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask);
#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -159,9 +163,13 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_fast_or_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast_or_avx512;
+ }
#endif
}
else
@@ -169,6 +177,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
}
@@ -193,6 +202,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -291,6 +307,74 @@ pg_popcount_fast_or_avx512(const char *buf, int bytes)
}
#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
+/*
+ * This is a wrapper function for pg_popcount_masked_avx512() that uses
+ * pg_popcount_masked_fast() when there aren't enough bytes to fit in an
+ * AVX-512 register. The compiler should be able to inline
+ * pg_popcount_masked_fast() so that we only take on additional function call
+ * overhead when it's likely to be a better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 64)
+ return pg_popcount_masked_fast(buf, bytes, mask);
+ else
+ return pg_popcount_masked_avx512(buf, bytes, mask);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
@@ -390,6 +474,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & maskv];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -421,4 +555,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..fb9ab3313b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,29 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
I committed v23-0001. Here is a rebased version of the remaining patches.
I intend to test the masking idea from Ants next.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v24-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v24 1/2] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 29 ++-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 673 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index de480da71e..2b35021bc6 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..2fa16b54b8 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,10 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +159,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_fast_or_avx512;
+#endif
}
else
{
@@ -224,7 +231,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -266,6 +273,24 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * This is a wrapper function for pg_popcount_avx512() that uses
+ * pg_popcount_fast() when there aren't enough bytes to fit in an AVX-512
+ * register. The compiler should be able to inline pg_popcount_fast() so that
+ * we only take on additional function call overhead when it's likely to be a
+ * better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_fast_or_avx512(const char *buf, int bytes)
+{
+ if (bytes < 64)
+ return pg_popcount_fast(buf, bytes);
+ else
+ return pg_popcount_avx512(buf, bytes);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v24-0002-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 872394c3bf67aaa9db0975be4a4fdf3d8863b105 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v24 2/2] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +---
src/include/port/pg_bitutils.h | 37 +++++-
src/port/pg_bitutils.c | 144 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 25 ++++
4 files changed, 210 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 2b35021bc6..a0d743d36b 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -361,6 +365,37 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ /*
+ * We set the threshold to the point at which we'll first use special
+ * instructions in the optimized version.
+ */
+#if SIZEOF_VOID_P >= 8
+ int threshold = 8;
+#else
+ int threshold = 4;
+#endif
+
+ if (bytes < threshold)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2fa16b54b8..11b69608fa 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,22 +106,26 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask);
#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -159,9 +163,13 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_fast_or_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast_or_avx512;
+ }
#endif
}
else
@@ -169,6 +177,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
}
@@ -193,6 +202,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -291,6 +307,74 @@ pg_popcount_fast_or_avx512(const char *buf, int bytes)
}
#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
+/*
+ * This is a wrapper function for pg_popcount_masked_avx512() that uses
+ * pg_popcount_masked_fast() when there aren't enough bytes to fit in an
+ * AVX-512 register. The compiler should be able to inline
+ * pg_popcount_masked_fast() so that we only take on additional function call
+ * overhead when it's likely to be a better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 64)
+ return pg_popcount_masked_fast(buf, bytes, mask);
+ else
+ return pg_popcount_masked_avx512(buf, bytes, mask);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
@@ -390,6 +474,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~0 / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -421,4 +555,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..fb9ab3313b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,29 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote:
I committed v23-0001. Here is a rebased version of the remaining patches.
I intend to test the masking idea from Ants next.
0002 was missing a cast that is needed for the 32-bit builds. I've fixed
that in v25.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v25-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From fe001e38b3f209c2fe615a2c4c64109d5e4d3da1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v25 1/2] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 15 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 29 ++-
src/port/pg_popcount_avx512.c | 49 ++++++
src/port/pg_popcount_avx512_choose.c | 71 ++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 673 insertions(+), 5 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..b48ed7f271 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..2bbd81dfb8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 18b5be842e..96be29c22b 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += '-mavx512vpopcntdq'
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index de480da71e..2b35021bc6 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,21 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+extern uint64 pg_popcount_fast(const char *buf, int bytes);
+
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files. Note that we only build this when TRY_POPCNT_FAST is set so that we
+ * can fall back to pg_popcount_fast() as needed.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..2fa16b54b8 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -114,7 +114,10 @@ static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
-static uint64 pg_popcount_fast(const char *buf, int bytes);
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
@@ -156,6 +159,10 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_fast_or_avx512;
+#endif
}
else
{
@@ -224,7 +231,7 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
* pg_popcount_fast
* Returns the number of 1-bits in buf
*/
-static uint64
+uint64
pg_popcount_fast(const char *buf, int bytes)
{
uint64 popcnt = 0;
@@ -266,6 +273,24 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * This is a wrapper function for pg_popcount_avx512() that uses
+ * pg_popcount_fast() when there aren't enough bytes to fit in an AVX-512
+ * register. The compiler should be able to inline pg_popcount_fast() so that
+ * we only take on additional function call overhead when it's likely to be a
+ * better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_fast_or_avx512(const char *buf, int bytes)
+{
+ if (bytes < 64)
+ return pg_popcount_fast(buf, bytes);
+ else
+ return pg_popcount_avx512(buf, bytes);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..f86558d1ee
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_fast(buf, bytes);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..9e81cd33ad
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We rely on
+ * pg_popcount_fast() as our fallback implementation in pg_popcount_avx512().
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v25-0002-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 6682412342d64268f51b8f98e515d74a97e60309 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v25 2/2] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +---
src/include/port/pg_bitutils.h | 37 +++++-
src/port/pg_bitutils.c | 144 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 25 ++++
4 files changed, 210 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 2b35021bc6..a0d743d36b 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,9 +303,11 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
-/* Export pg_popcount_fast() for use in the AVX512 implementation. */
+/* Exported for use in the AVX512 implementation. */
extern uint64 pg_popcount_fast(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -317,6 +319,7 @@ extern uint64 pg_popcount_fast(const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -324,6 +327,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -361,6 +365,37 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ /*
+ * We set the threshold to the point at which we'll first use special
+ * instructions in the optimized version.
+ */
+#if SIZEOF_VOID_P >= 8
+ int threshold = 8;
+#else
+ int threshold = 4;
+#endif
+
+ if (bytes < threshold)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 2fa16b54b8..934da27a62 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,22 +106,26 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
static uint64 pg_popcount_fast_or_avx512(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask);
#endif
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -159,9 +163,13 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_fast_or_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast_or_avx512;
+ }
#endif
}
else
@@ -169,6 +177,7 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
}
@@ -193,6 +202,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -291,6 +307,74 @@ pg_popcount_fast_or_avx512(const char *buf, int bytes)
}
#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
+/*
+ * This is a wrapper function for pg_popcount_masked_avx512() that uses
+ * pg_popcount_masked_fast() when there aren't enough bytes to fit in an
+ * AVX-512 register. The compiler should be able to inline
+ * pg_popcount_masked_fast() so that we only take on additional function call
+ * overhead when it's likely to be a better option.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+static uint64
+pg_popcount_masked_fast_or_avx512(const char *buf, int bytes, bits8 mask)
+{
+ if (bytes < 64)
+ return pg_popcount_masked_fast(buf, bytes, mask);
+ else
+ return pg_popcount_masked_avx512(buf, bytes, mask);
+}
+#endif /* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
+
#endif /* TRY_POPCNT_FAST */
@@ -390,6 +474,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -421,4 +555,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee..fb9ab3313b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -46,4 +46,29 @@ pg_popcount_avx512(const char *buf, int bytes)
return popcnt + pg_popcount_fast(buf, bytes);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+ {
+ const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+ const __m512i vmasked = _mm512_and_si512(val, maskv);
+ const __m512i cnt = _mm512_popcnt_epi64(vmasked);
+
+ accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
+ }
+
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return popcnt + pg_popcount_masked_fast(buf, bytes, mask);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.Sounds promising. IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions. I
searched around a little bit but haven't found anything that seemed
definitive.After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.
I tested your patch in comparison to v25 and saw the following:
bytes v25 v25+ants
2 1108.205 1033.132
4 1311.227 1289.373
8 1927.954 2360.113
16 2281.091 2365.408
32 3856.992 2390.688
64 3648.72 3242.498
128 4108.549 3607.148
256 4910.076 4496.852
For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise. At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.
If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.
Thoughts?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.Thoughts?
If we're worried about regressions with some narrow range of byte
values, wouldn't it make more sense to compare that to cc4826dd5~1 at
the latest rather than to some version that's already probably faster
than PG16?
David
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.
The approach I posted does not rely on masking performing page fault
suppression. All loads are 64 byte aligned and always contain at least
one byte of the buffer and therefore are guaranteed to be within a
valid page.
I personally don't mind it being slower for the very small cases,
because when performance on those sizes really matters it makes much
more sense to shoot for an inlined version instead.
Speaking of which, what does bumping up the inlined version threshold
to 16 do with and without AVX-512 available? Linearly extrapolating
the 2 and 4 byte numbers it might just come ahead in both cases,
making the choice easy.
Regards,
Ants Aasma
On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandbossart@gmail.com> wrote:
If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.If we're worried about regressions with some narrow range of byte
values, wouldn't it make more sense to compare that to cc4826dd5~1 at
the latest rather than to some version that's already probably faster
than PG16?
Good point. When compared with REL_16_STABLE, Ants's idea still wins:
bytes v25 v25+ants REL_16_STABLE
2 1108.205 1033.132 2039.342
4 1311.227 1289.373 3207.217
8 1927.954 2360.113 3200.238
16 2281.091 2365.408 4457.769
32 3856.992 2390.688 6206.689
64 3648.72 3242.498 9619.403
128 4108.549 3607.148 17912.081
256 4910.076 4496.852 33591.385
As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call. For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer. At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.
I think this is sufficient to justify switching approaches.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote:
Speaking of which, what does bumping up the inlined version threshold
to 16 do with and without AVX-512 available? Linearly extrapolating
the 2 and 4 byte numbers it might just come ahead in both cases,
making the choice easy.
IIRC the inlined version starts losing pretty quickly after 8 bytes. As I
noted in my previous message, I think we have enough data to switch to your
approach already, so I think it's a moot point.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Here is an updated patch set. IMHO this is in decent shape and is
approaching committable.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v26-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v26 1/2] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 11 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 5 +
src/port/pg_popcount_avx512.c | 74 ++++++++
src/port/pg_popcount_avx512_choose.c | 81 +++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 682 insertions(+), 3 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..72d20d3945 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..f35ce78a3c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 87437960bc..498e39cd3a 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index de480da71e..1512be1e4a 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,17 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..411be90f73 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -163,6 +163,11 @@ choose_popcount_functions(void)
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
}
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_avx512;
+#endif
}
static int
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..7c33c544e1
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ __m512i val,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 mask = ~UINT64CONST(0);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ mask = ~UINT64CONST(0);
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ mask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..4b0f1dc546
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for AVX512-BW? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v26-0002-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From f9bc3f8389890fabb85441b1a5df392c173e2422 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v26 2/2] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +----
src/include/port/pg_bitutils.h | 34 +++++++
src/port/pg_bitutils.c | 126 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 52 ++++++++++
4 files changed, 217 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1512be1e4a..a66723a719 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,6 +303,7 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -313,6 +314,7 @@ extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -320,6 +322,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +360,37 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ /*
+ * We set the threshold to the point at which we'll first use special
+ * instructions in the optimized version.
+ */
+#if SIZEOF_VOID_P >= 8
+ int threshold = 8;
+#else
+ int threshold = 4;
+#endif
+
+ if (bytes < threshold)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 411be90f73..88bc5cdbb1 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,19 +106,23 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
static uint64 pg_popcount_fast(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -156,17 +160,22 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_avx512;
+ }
#endif
}
@@ -191,6 +200,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -271,6 +287,56 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -370,6 +436,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -401,4 +517,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 7c33c544e1..dacc7553d2 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -71,4 +71,56 @@ pg_popcount_avx512(const char *buf, int bytes)
return _mm512_reduce_add_epi64(accum);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ __m512i val,
+ vmasked,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 bmask = ~UINT64CONST(0);
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ bmask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ bmask = ~UINT64CONST(0);
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ bmask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart <nathandbossart@gmail.com> wrote:
Here is an updated patch set. IMHO this is in decent shape and is
approaching committable.
I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.
The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:
clang:
<L2>:
50: add rdx, 64
54: cmp rdx, rdi
57: jae <L1>
59: vpopcntq zmm1, zmmword ptr [rdx]
5f: vpaddq zmm0, zmm1, zmm0
65: jmp <L2>
gcc:
<L1>:
38: kmovq k1, rdx
3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
43: add rax, 64
47: mov rdx, -1
4e: vpopcntq zmm0, zmm0
54: vpaddq zmm0, zmm0, zmm1
5a: vmovdqa64 zmm1, zmm0
60: cmp rax, rsi
63: jb <L1>
I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.
The visibility map patch has the same issue, otherwise looks good.
Regards,
Ants Aasma
Attachments:
avx512-peel-first-iteration.patchtext/x-patch; charset=US-ASCII; name=avx512-peel-first-iteration.patchDownload
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dacc7553d29..f6e718b86e9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes)
* Iterate through all but the final iteration. Starting from second
* iteration, the start index mask is ignored.
*/
- for (; buf < final; buf += sizeof(__m512i))
+ if (buf < final)
{
val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
cnt = _mm512_popcnt_epi64(val);
accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
mask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
}
/* Final iteration needs to ignore bytes that are not within the length */
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:clang:
<L2>:
50: add rdx, 64
54: cmp rdx, rdi
57: jae <L1>
59: vpopcntq zmm1, zmmword ptr [rdx]
5f: vpaddq zmm0, zmm1, zmm0
65: jmp <L2>gcc:
<L1>:
38: kmovq k1, rdx
3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
43: add rax, 64
47: mov rdx, -1
4e: vpopcntq zmm0, zmm0
54: vpaddq zmm0, zmm0, zmm1
5a: vmovdqa64 zmm1, zmm0
60: cmp rax, rsi
63: jb <L1>I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.
Yeah, I did see this, but I also wasn't sure if it was worth further
complicating the code. I can test with and without your fix and see if it
makes any difference in the benchmarks.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote:
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:clang:
<L2>:
50: add rdx, 64
54: cmp rdx, rdi
57: jae <L1>
59: vpopcntq zmm1, zmmword ptr [rdx]
5f: vpaddq zmm0, zmm1, zmm0
65: jmp <L2>gcc:
<L1>:
38: kmovq k1, rdx
3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
43: add rax, 64
47: mov rdx, -1
4e: vpopcntq zmm0, zmm0
54: vpaddq zmm0, zmm0, zmm1
5a: vmovdqa64 zmm1, zmm0
60: cmp rax, rsi
63: jb <L1>I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.Yeah, I did see this, but I also wasn't sure if it was worth further
complicating the code. I can test with and without your fix and see if it
makes any difference in the benchmarks.
This seems to provide a small performance boost, so I've incorporated it
into v27.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v27-0001-AVX512-popcount-support.patchtext/x-diff; charset=us-asciiDownload
From 9fc4b7556b72d51fce676db84b446099767efff3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v27 1/2] AVX512 popcount support
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 11 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 5 +
src/port/pg_popcount_avx512.c | 82 +++++++++
src/port/pg_popcount_avx512_choose.c | 81 +++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 690 insertions(+), 3 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..892b3c9580 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..72d20d3945 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..f35ce78a3c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 87437960bc..498e39cd3a 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX512 popcount without -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX512 popcount with -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..de067e6182 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index de480da71e..1512be1e4a 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,17 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..411be90f73 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -163,6 +163,11 @@ choose_popcount_functions(void)
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
}
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_avx512;
+#endif
}
static int
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..03a5bfb137
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ __m512i val,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 mask = ~UINT64CONST(0);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ if (buf < final)
+ {
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ buf += sizeof(__m512i);
+ mask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ mask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..4b0f1dc546
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for AVX512-BW? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#endif
+ if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v27-0002-optimize-visibilitymap_count-with-AVX512.patchtext/x-diff; charset=us-asciiDownload
From 299cdad98783b81b3f1418128089f5937e62cc84 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sun, 31 Mar 2024 22:22:15 -0500
Subject: [PATCH v27 2/2] optimize visibilitymap_count() with AVX512
---
src/backend/access/heap/visibilitymap.c | 25 +----
src/include/port/pg_bitutils.h | 34 +++++++
src/port/pg_bitutils.c | 126 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 61 ++++++++++++
4 files changed, 226 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 1512be1e4a..a66723a719 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,6 +303,7 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX512 popcount instruction on some systems.
@@ -313,6 +314,7 @@ extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -320,6 +322,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +360,37 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ /*
+ * We set the threshold to the point at which we'll first use special
+ * instructions in the optimized version.
+ */
+#if SIZEOF_VOID_P >= 8
+ int threshold = 8;
+#else
+ int threshold = 4;
+#endif
+
+ if (bytes < threshold)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 411be90f73..88bc5cdbb1 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,19 +106,23 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
static uint64 pg_popcount_fast(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -156,17 +160,22 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_avx512;
+ }
#endif
}
@@ -191,6 +200,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -271,6 +287,56 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -370,6 +436,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -401,4 +517,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 03a5bfb137..c05210872c 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -79,4 +79,65 @@ pg_popcount_avx512(const char *buf, int bytes)
return _mm512_reduce_add_epi64(accum);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ __m512i val,
+ vmasked,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 bmask = ~UINT64CONST(0);
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ bmask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ if (buf < final)
+ {
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ buf += sizeof(__m512i);
+ bmask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ bmask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Sat, 6 Apr 2024 at 04:38, Nathan Bossart <nathandbossart@gmail.com> wrote:
This seems to provide a small performance boost, so I've incorporated it
into v27.
Won't Valgrind complain about this?
+pg_popcount_avx512(const char *buf, int bytes)
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
David
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
Won't Valgrind complain about this?
+pg_popcount_avx512(const char *buf, int bytes)
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
I haven't been able to generate any complaints, at least with some simple
tests. But I see your point. If this did cause such complaints, ISTM we'd
just want to add it to the suppression file. Otherwise, I think we'd have
to go back to the non-maskz approach (which I really wanted to avoid
because of the weird function overhead juggling) or find another way to do
a partial load into an __m512i.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
Won't Valgrind complain about this?
+pg_popcount_avx512(const char *buf, int bytes)
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
I haven't been able to generate any complaints, at least with some simple
tests. But I see your point. If this did cause such complaints, ISTM we'd
just want to add it to the suppression file. Otherwise, I think we'd have
to go back to the non-maskz approach (which I really wanted to avoid
because of the weird function overhead juggling) or find another way to do
a partial load into an __m512i.
[1]: https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i
shouldn't segfault:
The following seems to run without any issue and if I change the mask
to 1 it crashes, as you'd expect.
#include <immintrin.h>
#include <stdio.h>
int main(void)
{
__m512i val;
val = _mm512_maskz_loadu_epi8((__mmask64) 0, NULL);
printf("%llu\n", _mm512_reduce_add_epi64(val));
return 0;
}
gcc avx512.c -o avx512 -O0 -mavx512f -march=native
David
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote:
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
Won't Valgrind complain about this?
+pg_popcount_avx512(const char *buf, int bytes)
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
I haven't been able to generate any complaints, at least with some simple
tests. But I see your point. If this did cause such complaints, ISTM we'd
just want to add it to the suppression file. Otherwise, I think we'd have
to go back to the non-maskz approach (which I really wanted to avoid
because of the weird function overhead juggling) or find another way to do
a partial load into an __m512i.[1] seems to think it's ok. If this is true then the following
shouldn't segfault:The following seems to run without any issue and if I change the mask
to 1 it crashes, as you'd expect.
Cool.
Here is what I have staged for commit, which I intend to do shortly. At
some point, I'd like to revisit converting TRY_POPCNT_FAST to a
configure-time check and maybe even moving the "fast" and "slow"
implementations to their own files, but since that's mostly for code
neatness and we are rapidly approaching the v17 deadline, I'm content to
leave that for v18.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v28-0001-Optimize-pg_popcount-with-AVX-512-instructions.patchtext/x-diff; charset=us-asciiDownload
From 9eea492222555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions.
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can perform these tasks in 512-bit chunks, which can provide a nice
speedup, especially for larger buffers.
This commit introduces the infrastructure required to detect both
compiler and CPU support for the required AVX-512 intrinsic
functions, and it makes use of that infrastructure in a new
pg_popcount() implementation. If CPU support for this optimized
implementation is detected at runtime, a function pointer is
updated so that it is used for subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
nicely from these instructions, and calls for smaller buffers
should not regress when compared to v16. The new infrastructure
introduced by this commit can also be used to optimized
visibilitymap_count(), but that work is left for a follow-up
commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
---
config/c-compiler.m4 | 58 ++++++
configure | 252 +++++++++++++++++++++++++++
configure.ac | 51 ++++++
meson.build | 87 +++++++++
src/Makefile.global.in | 5 +
src/include/pg_config.h.in | 12 ++
src/include/port/pg_bitutils.h | 11 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 11 ++
src/port/meson.build | 6 +-
src/port/pg_bitutils.c | 5 +
src/port/pg_popcount_avx512.c | 82 +++++++++
src/port/pg_popcount_avx512_choose.c | 87 +++++++++
src/test/regress/expected/bit.out | 24 +++
src/test/regress/sql/bit.sql | 4 +
15 files changed, 696 insertions(+), 3 deletions(-)
create mode 100644 src/port/pg_popcount_avx512.c
create mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..cfff48c1bc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
fi
undefine([Ac_cachevar])dnl
])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# ---------------------
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [return _xgetbv(0) & 0xe0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_XSAVE="$1"
+ pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -----------------------------
+# Check if the compiler supports the AVX-512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
+# -mavx512bw). If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+ [const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;])],
+ [Ac_cachevar=yes],
+ [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+ CFLAGS_POPCNT="$1"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..cfbd2a096f 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,9 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
+PG_POPCNT_OBJS
+CFLAGS_POPCNT
+CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17404,6 +17407,40 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <cpuid.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__get_cpuid_count="yes"
+else
+ pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
$as_echo_n "checking for __cpuid... " >&6; }
if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17475,221 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for __cpuidex... " >&6; }
+if ${pgac_cv__cpuidex+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <intrin.h>
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv__cpuidex="yes"
+else
+ pgac_cv__cpuidex="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
+$as_echo "$pgac_cv__cpuidex" >&6; }
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+
+$as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
+
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
+if ${pgac_cv_xsave_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics_=yes
+else
+ pgac_cv_xsave_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
+$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
+if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
+ CFLAGS_XSAVE=""
+ pgac_xsave_intrinsics=yes
+fi
+
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
+$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
+if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mxsave"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+return _xgetbv(0) & 0xe0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_xsave_intrinsics__mxsave=yes
+else
+ pgac_cv_xsave_intrinsics__mxsave=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
+$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
+if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
+ CFLAGS_XSAVE="-mxsave"
+ pgac_xsave_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+
+$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
+
+fi
+
+
+# Check for AVX-512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics_=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
+ CFLAGS_POPCNT=""
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#include <immintrin.h>
+int
+main ()
+{
+const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+else
+ pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
+ CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+ pgac_avx512_popcnt_intrinsics=yes
+fi
+
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+
+$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+ fi
+fi
+
+
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 57f734879e..67e738d92b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,6 +2052,17 @@ if test x"$pgac_cv__get_cpuid" = x"yes"; then
AC_DEFINE(HAVE__GET_CPUID, 1, [Define to 1 if you have __get_cpuid.])
fi
+AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ ]])],
+ [pgac_cv__get_cpuid_count="yes"],
+ [pgac_cv__get_cpuid_count="no"])])
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+ AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
+fi
+
AC_CACHE_CHECK([for __cpuid], [pgac_cv__cpuid],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
[[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2063,6 +2074,46 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
fi
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
+ [[unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuidex(exx[0], 7, 0);
+ ]])],
+ [pgac_cv__cpuidex="yes"],
+ [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+ AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
+fi
+
+# Check for XSAVE intrinsics
+#
+CFLAGS_XSAVE=""
+PGAC_XSAVE_INTRINSICS([])
+if test x"$pgac_xsave_intrinsics" != x"yes"; then
+ PGAC_XSAVE_INTRINSICS([-mxsave])
+fi
+if test x"$pgac_xsave_intrinsics" = x"yes"; then
+ AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
+fi
+AC_SUBST(CFLAGS_XSAVE)
+
+# Check for AVX-512 popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+if test x"$host_cpu" = x"x86_64"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([])
+ if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
+ PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
+ fi
+ if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
+ PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
+ AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
+ fi
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/meson.build b/meson.build
index 87437960bc..5acf083ce3 100644
--- a/meson.build
+++ b/meson.build
@@ -1783,6 +1783,30 @@ elif cc.links('''
endif
+# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+if cc.links('''
+ #include <cpuid.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+ }
+ ''', name: '__get_cpuid_count',
+ args: test_c_args)
+ cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+ #include <intrin.h>
+ int main(int arg, char **argv)
+ {
+ unsigned int exx[4] = {0, 0, 0, 0};
+ __cpuidex(exx, 7, 0);
+ }
+ ''', name: '__cpuidex',
+ args: test_c_args)
+ cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
# Defend against clang being used on x86-32 without SSE2 enabled. As current
# versions of clang do not understand -fexcess-precision=standard, the use of
# x87 floating point operations leads to problems like isinf possibly returning
@@ -1996,6 +2020,69 @@ int main(void)
endif
+###############################################################
+# Check for the availability of XSAVE intrinsics.
+###############################################################
+
+cflags_xsave = []
+if host_cpu == 'x86' or host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ return _xgetbv(0) & 0xe0;
+}
+'''
+
+ if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
+ args: test_c_args)
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
+ args: test_c_args + ['-mxsave'])
+ cdata.set('HAVE_XSAVE_INTRINSICS', 1)
+ cflags_xsave += '-mxsave'
+ endif
+
+endif
+
+
+###############################################################
+# Check for the availability of AVX-512 popcount intrinsics.
+###############################################################
+
+cflags_popcnt = []
+if host_cpu == 'x86_64'
+
+ prog = '''
+#include <immintrin.h>
+
+int main(void)
+{
+ const char buf[sizeof(__m512i)];
+ INT64 popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ /* return computed value, to prevent the above being optimized away */
+ return popcnt == 0;
+}
+'''
+
+ if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
+ args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
+ cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
+ cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
+ endif
+
+endif
+
###############################################################
# Select CRC-32C implementation.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8b3f8c24e0..36d880d225 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,7 +262,9 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
CXXFLAGS = @CXXFLAGS@
@@ -758,6 +760,9 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
+
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df..f8d3e3b6b8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -513,6 +513,9 @@
/* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
#undef HAVE_X86_64_POPCNTQ
+/* Define to 1 if you have XSAVE intrinsics. */
+#undef HAVE_XSAVE_INTRINSICS
+
/* Define to 1 if the system has the type `_Bool'. */
#undef HAVE__BOOL
@@ -555,9 +558,15 @@
/* Define to 1 if you have __cpuid. */
#undef HAVE__CPUID
+/* Define to 1 if you have __cpuidex. */
+#undef HAVE__CPUIDEX
+
/* Define to 1 if you have __get_cpuid. */
#undef HAVE__GET_CPUID
+/* Define to 1 if you have __get_cpuid_count. */
+#undef HAVE__GET_CPUID_COUNT
+
/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT
@@ -680,6 +689,9 @@
/* Define to 1 to build with assertion checks. (--enable-cassert) */
#undef USE_ASSERT_CHECKING
+/* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
+#undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+
/* Define to 1 to build with Bonjour support. (--with-bonjour) */
#undef USE_BONJOUR
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index de480da71e..b453f84d8f 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -304,6 +304,17 @@ extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+/*
+ * We can also try to use the AVX-512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+
#else
/* Use a portable implementation -- no need for a function pointer. */
extern int pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index b0f4178b3d..5618050b30 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -100,8 +100,10 @@ pgxs_kv = {
' '.join(cflags_no_decl_after_statement),
'CFLAGS_CRC': ' '.join(cflags_crc),
+ 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
+ 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -177,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..db7c02117b 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
+ $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
@@ -92,6 +93,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
+pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
+pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
+
+# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
+pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..fd9ee199d1 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -84,6 +84,8 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+ ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+ ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -98,8 +100,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc}
-pgport_sources_cflags = {'crc': []}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
+pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 6271acea60..411be90f73 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -163,6 +163,11 @@ choose_popcount_functions(void)
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
}
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+ if (pg_popcount_avx512_available())
+ pg_popcount_optimized = pg_popcount_avx512;
+#endif
}
static int
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
new file mode 100644
index 0000000000..0040361cf6
--- /dev/null
+++ b/src/port/pg_popcount_avx512.c
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512.c
+ * Holds the pg_popcount() implementation that uses AVX-512 instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include <immintrin.h>
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * pg_popcount_avx512
+ * Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_avx512(const char *buf, int bytes)
+{
+ __m512i val,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 mask = ~UINT64CONST(0);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ if (buf < final)
+ {
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ buf += sizeof(__m512i);
+ mask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ mask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
new file mode 100644
index 0000000000..d54147b88c
--- /dev/null
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_avx512_choose.c
+ * Test whether we can use AVX-512 POPCNT instructions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/port/pg_popcount_avx512_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE_XSAVE_INTRINSICS
+#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
+
+#include "port/pg_bitutils.h"
+
+/*
+ * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
+ * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ */
+#ifdef TRY_POPCNT_FAST
+
+/*
+ * Returns true if the CPU supports AVX-512 POPCNT.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+ /* does CPUID say there's support for AVX-512 POPCNT? */
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */
+ return false;
+
+ /* does CPUID say there's support for AVX-512 BW? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */
+ return false;
+
+ /* does CPUID say there's support for XGETBV? */
+ memset(exx, 0, sizeof(exx));
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+ if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ return false;
+
+ /* does XGETBV say the ZMM registers are enabled? */
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe0) != 0;
+#else
+ return false;
+#endif
+}
+
+#endif /* TRY_POPCNT_FAST */
diff --git a/src/test/regress/expected/bit.out b/src/test/regress/expected/bit.out
index e17cbf42ca..6a436288bb 100644
--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -740,6 +740,30 @@ SELECT bit_count(B'1111111111'::bit(10));
10
(1 row)
+SELECT bit_count(repeat('0', 100)::bit(100));
+ bit_count
+-----------
+ 0
+(1 row)
+
+SELECT bit_count(repeat('1', 100)::bit(100));
+ bit_count
+-----------
+ 100
+(1 row)
+
+SELECT bit_count(repeat('01', 500)::bit(1000));
+ bit_count
+-----------
+ 500
+(1 row)
+
+SELECT bit_count(repeat('10101', 200)::bit(1000));
+ bit_count
+-----------
+ 600
+(1 row)
+
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
b1 bit(4) DEFAULT '1001',
diff --git a/src/test/regress/sql/bit.sql b/src/test/regress/sql/bit.sql
index 34230b99fb..8ba6facd03 100644
--- a/src/test/regress/sql/bit.sql
+++ b/src/test/regress/sql/bit.sql
@@ -223,6 +223,10 @@ SELECT overlay(B'0101011100' placing '001' from 20);
-- bit_count
SELECT bit_count(B'0101011100'::bit(10));
SELECT bit_count(B'1111111111'::bit(10));
+SELECT bit_count(repeat('0', 100)::bit(100));
+SELECT bit_count(repeat('1', 100)::bit(100));
+SELECT bit_count(repeat('01', 500)::bit(1000));
+SELECT bit_count(repeat('10101', 200)::bit(1000));
-- This table is intentionally left around to exercise pg_dump/pg_upgrade
CREATE TABLE bit_defaults(
--
2.25.1
v28-0002-Optimize-visibilitymap_count-with-AVX-512-instru.patchtext/x-diff; charset=us-asciiDownload
From 01e8c3fc481fda518b7d92cb6af044c6cda410e3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 6 Apr 2024 11:55:48 -0500
Subject: [PATCH v28 2/2] Optimize visibilitymap_count() with AVX-512
instructions.
Thanks to the infrastructure added by commit XXXXXXXXXX, we can
pretty easily optimize this function with AVX-512 intrinsic
functions. A new pg_popcount_masked() function is introduced that
applies a bitmask to every byte in the buffer prior to calculating
the population count, which is used to filter out the all-visible
or all-frozen bits as needed. Platforms without AVX-512 support
should also see a nice speedup due to the reduced number of calls
to a function pointer.
Co-authored-by: Ants Aasma
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
---
src/backend/access/heap/visibilitymap.c | 25 +----
src/include/port/pg_bitutils.h | 34 +++++++
src/port/pg_bitutils.c | 126 ++++++++++++++++++++++++
src/port/pg_popcount_avx512.c | 61 ++++++++++++
4 files changed, 226 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3..8b24e7bc33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -119,10 +119,8 @@
#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Masks for counting subsets of bits in the visibility map. */
-#define VISIBLE_MASK64 UINT64CONST(0x5555555555555555) /* The lower bit of each
- * bit pair */
-#define FROZEN_MASK64 UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
- * bit pair */
+#define VISIBLE_MASK8 (0x55) /* The lower bit of each bit pair */
+#define FROZEN_MASK8 (0xaa) /* The upper bit of each bit pair */
/* prototypes for internal routines */
static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
@@ -396,7 +394,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
{
Buffer mapBuffer;
uint64 *map;
- int i;
/*
* Read till we fall off the end of the map. We assume that any extra
@@ -414,21 +411,9 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
*/
map = (uint64 *) PageGetContents(BufferGetPage(mapBuffer));
- StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
- "unsupported MAPSIZE");
- if (all_frozen == NULL)
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- }
- else
- {
- for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
- {
- nvisible += pg_popcount64(map[i] & VISIBLE_MASK64);
- nfrozen += pg_popcount64(map[i] & FROZEN_MASK64);
- }
- }
+ nvisible += pg_popcount_masked((const char *) map, MAPSIZE, VISIBLE_MASK8);
+ if (all_frozen)
+ nfrozen += pg_popcount_masked((const char *) map, MAPSIZE, FROZEN_MASK8);
ReleaseBuffer(mapBuffer);
}
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index b453f84d8f..4d88478c9c 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -303,6 +303,7 @@ pg_ceil_log2_64(uint64 num)
extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
/*
* We can also try to use the AVX-512 popcount instruction on some systems.
@@ -313,6 +314,7 @@ extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
extern bool pg_popcount_avx512_available(void);
extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
#endif
#else
@@ -320,6 +322,7 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
extern int pg_popcount32(uint32 word);
extern int pg_popcount64(uint64 word);
extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
#endif /* TRY_POPCNT_FAST */
@@ -357,6 +360,37 @@ pg_popcount(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+/*
+ * Returns the number of 1-bits in buf after applying the mask to each byte.
+ *
+ * Similar to pg_popcount(), we only take on the function pointer overhead when
+ * it's likely to be faster.
+ */
+static inline uint64
+pg_popcount_masked(const char *buf, int bytes, bits8 mask)
+{
+ /*
+ * We set the threshold to the point at which we'll first use special
+ * instructions in the optimized version.
+ */
+#if SIZEOF_VOID_P >= 8
+ int threshold = 8;
+#else
+ int threshold = 4;
+#endif
+
+ if (bytes < threshold)
+ {
+ uint64 popcnt = 0;
+
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+ return popcnt;
+ }
+
+ return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
/*
* Rotate the bits of "word" to the right/left by n bits.
*/
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 411be90f73..88bc5cdbb1 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -106,19 +106,23 @@ const uint8 pg_number_of_ones[256] = {
static inline int pg_popcount32_slow(uint32 word);
static inline int pg_popcount64_slow(uint64 word);
static uint64 pg_popcount_slow(const char *buf, int bytes);
+static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
#ifdef TRY_POPCNT_FAST
static bool pg_popcount_available(void);
static int pg_popcount32_choose(uint32 word);
static int pg_popcount64_choose(uint64 word);
static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
static inline int pg_popcount32_fast(uint32 word);
static inline int pg_popcount64_fast(uint64 word);
static uint64 pg_popcount_fast(const char *buf, int bytes);
+static uint64 pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask);
int (*pg_popcount32) (uint32 word) = pg_popcount32_choose;
int (*pg_popcount64) (uint64 word) = pg_popcount64_choose;
uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
#endif /* TRY_POPCNT_FAST */
#ifdef TRY_POPCNT_FAST
@@ -156,17 +160,22 @@ choose_popcount_functions(void)
pg_popcount32 = pg_popcount32_fast;
pg_popcount64 = pg_popcount64_fast;
pg_popcount_optimized = pg_popcount_fast;
+ pg_popcount_masked_optimized = pg_popcount_masked_fast;
}
else
{
pg_popcount32 = pg_popcount32_slow;
pg_popcount64 = pg_popcount64_slow;
pg_popcount_optimized = pg_popcount_slow;
+ pg_popcount_masked_optimized = pg_popcount_masked_slow;
}
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
if (pg_popcount_avx512_available())
+ {
pg_popcount_optimized = pg_popcount_avx512;
+ pg_popcount_masked_optimized = pg_popcount_masked_avx512;
+ }
#endif
}
@@ -191,6 +200,13 @@ pg_popcount_choose(const char *buf, int bytes)
return pg_popcount_optimized(buf, bytes);
}
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+ choose_popcount_functions();
+ return pg_popcount_masked(buf, bytes, mask);
+}
+
/*
* pg_popcount32_fast
* Return the number of 1 bits set in word
@@ -271,6 +287,56 @@ pg_popcount_fast(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_fast
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_fast(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_fast(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#endif /* TRY_POPCNT_FAST */
@@ -370,6 +436,56 @@ pg_popcount_slow(const char *buf, int bytes)
return popcnt;
}
+/*
+ * pg_popcount_masked_slow
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+static uint64
+pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
+{
+ uint64 popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+ /* Process in 64-bit chunks if the buffer is aligned */
+ uint64 maskv = ~UINT64CONST(0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(8, buf))
+ {
+ const uint64 *words = (const uint64 *) buf;
+
+ while (bytes >= 8)
+ {
+ popcnt += pg_popcount64_slow(*words++ & maskv);
+ bytes -= 8;
+ }
+
+ buf = (const char *) words;
+ }
+#else
+ /* Process in 32-bit chunks if the buffer is aligned. */
+ uint32 maskv = ~((uint32) 0) / 0xFF * mask;
+
+ if (buf == (const char *) TYPEALIGN(4, buf))
+ {
+ const uint32 *words = (const uint32 *) buf;
+
+ while (bytes >= 4)
+ {
+ popcnt += pg_popcount32_slow(*words++ & maskv);
+ bytes -= 4;
+ }
+
+ buf = (const char *) words;
+ }
+#endif
+
+ /* Process any remaining bytes */
+ while (bytes--)
+ popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+ return popcnt;
+}
+
#ifndef TRY_POPCNT_FAST
/*
@@ -401,4 +517,14 @@ pg_popcount_optimized(const char *buf, int bytes)
return pg_popcount_slow(buf, bytes);
}
+/*
+ * pg_popcount_masked_optimized
+ * Returns the number of 1-bits in buf after apply the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+ return pg_popcount_masked_slow(buf, bytes, mask);
+}
+
#endif /* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 0040361cf6..a52615eb8b 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -79,4 +79,65 @@ pg_popcount_avx512(const char *buf, int bytes)
return _mm512_reduce_add_epi64(accum);
}
+/*
+ * pg_popcount_masked_avx512
+ * Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
+{
+ __m512i val,
+ vmasked,
+ cnt;
+ __m512i accum = _mm512_setzero_si512();
+ const char *final;
+ int tail_idx;
+ __mmask64 bmask = ~UINT64CONST(0);
+ const __m512i maskv = _mm512_set1_epi8(mask);
+
+ /*
+ * Align buffer down to avoid double load overhead from unaligned access.
+ * Calculate a mask to ignore preceding bytes. Find start offset of final
+ * iteration and number of valid bytes making sure that final iteration is
+ * not empty.
+ */
+ bmask <<= ((uintptr_t) buf) % sizeof(__m512i);
+ tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+ final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+ buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
+
+ /*
+ * Iterate through all but the final iteration. Starting from second
+ * iteration, the start index mask is ignored.
+ */
+ if (buf < final)
+ {
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ buf += sizeof(__m512i);
+ bmask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
+ }
+
+ /* Final iteration needs to ignore bytes that are not within the length */
+ bmask &= (~UINT64CONST(0) >> (sizeof(__m512i) - tail_idx));
+
+ val = _mm512_maskz_loadu_epi8(bmask, (const __m512i *) buf);
+ vmasked = _mm512_and_si512(val, maskv);
+ cnt = _mm512_popcnt_epi64(vmasked);
+ accum = _mm512_add_epi64(accum, cnt);
+
+ return _mm512_reduce_add_epi64(accum);
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote:
Here is what I have staged for commit, which I intend to do shortly.
Committed.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
Here is what I have staged for commit, which I intend to do shortly.
Today's Coverity run produced this warning, which seemingly was
triggered by one of these commits, but I can't make much sense
of it:
*** CID 1596255: Uninitialized variables (UNINIT)
/usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in _mm256_undefined_si256()
1214 extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
1215 _mm256_undefined_si256 (void)
1216 {
1217 __m256i __Y = __Y;
CID 1596255: Uninitialized variables (UNINIT)
Using uninitialized value "__Y".
1218 return __Y;
1219 }
I see the same code in my local copy of avxintrin.h,
and I quite agree that it looks like either an undefined
value or something that properly ought to be an error.
If we are calling this, why (and from where)?
Anyway, we can certainly just dismiss this warning if it
doesn't correspond to any real problem in our code.
But I thought I'd raise the question.
regards, tom lane
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote:
Today's Coverity run produced this warning, which seemingly was
triggered by one of these commits, but I can't make much sense
of it:*** CID 1596255: Uninitialized variables (UNINIT)
/usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in _mm256_undefined_si256()
1214 extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
1215 _mm256_undefined_si256 (void)
1216 {
1217 __m256i __Y = __Y;CID 1596255: Uninitialized variables (UNINIT)
Using uninitialized value "__Y".1218 return __Y;
1219 }I see the same code in my local copy of avxintrin.h,
and I quite agree that it looks like either an undefined
value or something that properly ought to be an error.
If we are calling this, why (and from where)?
Nothing in these commits uses this, or even uses the 256-bit registers.
avxintrin.h is included by immintrin.h, which is probably why this is
showing up. I believe you're supposed to use immintrin.h for the
intrinsics used in these commits, so I don't immediately see a great way to
avoid this. The Intel documentation for _mm256_undefined_si256() [0]https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_undefined_si256&ig_expand=6943
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.
Anyway, we can certainly just dismiss this warning if it
doesn't correspond to any real problem in our code.
But I thought I'd raise the question.
That's probably the right thing to do, unless there's some action we can
take to suppress this warning.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
The Intel documentation for _mm256_undefined_si256() [0]
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.
See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes:
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
The Intel documentation for _mm256_undefined_si256() [0]
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.
See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.
Ah, interesting. That hasn't propagated to stable distros yet,
evidently (and even when it does, I wonder how soon Coverity
will understand it). Anyway, that does establish that it's
gcc's problem not ours. Thanks for digging!
regards, tom lane
It was brought to my attention [0]https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463 that we probably should be checking for
the OSXSAVE bit instead of the XSAVE bit when determining whether there's
support for the XGETBV instruction. IIUC that should indicate that both
the OS and the processor have XGETBV support (not just the processor).
I've attached a one-line patch to fix this.
[0]: https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
osxsave.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..cc3e89e096 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -74,7 +74,7 @@ pg_popcount_avx512_available(void)
#else
#error cpuid instruction not available
#endif
- if ((exx[2] & (1 << 26)) == 0) /* xsave */
+ if ((exx[2] & (1 << 27)) == 0) /* osxsave */
return false;
/* Does XGETBV say the ZMM registers are enabled? */
It was brought to my attention [0] that we probably should be checking for the OSXSAVE bit instead of the XSAVE bit when determining whether there's support for the XGETBV instruction. IIUC that should indicate that both the OS and the processor have XGETBV support (not just the processor).
I've attached a one-line patch to fix this.
[0] https://github.com/pgvector/pgvector/pull/519#issuecomment-2062804463
Good find. I confirmed after speaking with an intel expert, and from the intel AVX-512 manual [0]https://cdrdv2.intel.com/v1/dl/getContent/671200 section 14.3, which recommends to check bit27. From the manual:
"Prior to using Intel AVX, the application must identify that the operating system supports the XGETBV instruction,
the YMM register state, in addition to processor's support for YMM state management using XSAVE/XRSTOR and
AVX instructions. The following simplified sequence accomplishes both and is strongly recommended.
1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1).
2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state are enabled by OS).
3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
(Step 3 can be done in any order relative to 1 and 2.)"
It also seems that step 1 and step 2 need to be done prior to the CPUID OSXSAVE check in the popcount code.
[0]: https://cdrdv2.intel.com/v1/dl/getContent/671200
- Akash Shankaran
On Thu, Apr 18, 2024 at 06:12:22PM +0000, Shankaran, Akash wrote:
Good find. I confirmed after speaking with an intel expert, and from the intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From the manual:
"Prior to using Intel AVX, the application must identify that the operating system supports the XGETBV instruction,
the YMM register state, in addition to processor's support for YMM state management using XSAVE/XRSTOR and
AVX instructions. The following simplified sequence accomplishes both and is strongly recommended.
1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1).
2) Issue XGETBV and verify that XCR0[2:1] = '11b' (XMM state and YMM state are enabled by OS).
3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
(Step 3 can be done in any order relative to 1 and 2.)"
Thanks for confirming. IIUC my patch should be sufficient, then.
It also seems that step 1 and step 2 need to be done prior to the CPUID OSXSAVE check in the popcount code.
This seems to contradict the note about doing step 3 at any point, and
given step 1 is the OSXSAVE check, I'm not following what this means,
anyway.
I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower half
of some of the ZMM registers is stored in the SSE and AVX state [0]https://en.wikipedia.org/wiki/Control_register#cite_ref-23. I
don't know how likely it is that 0xe0 would succeed but 0xe6 wouldn't, but
we might as well make it correct.
[0]: https://en.wikipedia.org/wiki/Control_register#cite_ref-23
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Apr 18, 2024 at 08:24:03PM +0000, Devulapalli, Raghuveer wrote:
This seems to contradict the note about doing step 3 at any point, and
given step 1 is the OSXSAVE check, I'm not following what this means,
anyway.It is recommended that you run the xgetbv code before you check for cpu
features avx512-popcnt and avx512-bw. The way it is written now is the
opposite order. I would also recommend splitting the cpuid feature check
for avx512popcnt/avx512bw and xgetbv section into separate functions to
make them modular. Something like:static inline
int check_os_avx512_support(void)
{
// (1) run cpuid leaf 1 to check for xgetbv instruction support:
unsigned int exx[4] = {0, 0, 0, 0};
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
if ((exx[2] & (1 << 27)) == 0) /* xsave */
return false;/* Does XGETBV say the ZMM/YMM/XMM registers are enabled? */
return (_xgetbv(0) & 0xe0) == 0xe0;
}I'm also wondering if we need to check that (_xgetbv(0) & 0xe6) == 0xe6
instead of just (_xgetbv(0) & 0xe0) != 0, as the status of the lower
half of some of the ZMM registers is stored in the SSE and AVX state
[0]. I don't know how likely it is that 0xe0 would succeed but 0xe6
wouldn't, but we might as well make it correct.This is correct. It needs to check all the 3 bits (XMM/YMM and ZMM). The
way it is written is now is in-correct.
Thanks for the feedback. I've attached an updated patch.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-osxsave.patchtext/x-diff; charset=us-asciiDownload
From d20b19804a17d9f6eab1d40de7e9fb10488ac6b0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v2 1/1] osxsave
---
src/port/pg_popcount_avx512_choose.c | 89 +++++++++++++++++++---------
1 file changed, 60 insertions(+), 29 deletions(-)
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..009f94909a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,27 +34,47 @@
#ifdef TRY_POPCNT_FAST
/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
*/
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
{
unsigned int exx[4] = {0, 0, 0, 0};
- /* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
#else
#error cpuid instruction not available
#endif
- if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */
- return false;
+ return (exx[2] & (1 << 27)) != 0; /* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe6) != 0xe6;
+#else
+ return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
- /* Does CPUID say there's support for AVX-512 byte and word instructions? */
- memset(exx, 0, sizeof(exx));
#if defined(HAVE__GET_CPUID_COUNT)
__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUIDEX)
@@ -62,27 +82,38 @@ pg_popcount_avx512_available(void)
#else
#error cpuid instruction not available
#endif
- if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */
- return false;
+ return (exx[2] & (1 << 14)) != 0; /* avx512-vpopcntdq */
+}
- /* Does CPUID say there's support for XSAVE instructions? */
- memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
+/*
+ * Does CPUID say there's support for AVX-512 byte and word instructions?
+ */
+static inline bool
+avx512_bw_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
#else
#error cpuid instruction not available
#endif
- if ((exx[2] & (1 << 26)) == 0) /* xsave */
- return false;
+ return (exx[1] & (1 << 30)) != 0; /* avx512-bw */
+}
- /* Does XGETBV say the ZMM registers are enabled? */
-#ifdef HAVE_XSAVE_INTRINSICS
- return (_xgetbv(0) & 0xe0) != 0;
-#else
- return false;
-#endif
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ return xsave_available() &&
+ zmm_regs_available() &&
+ avx512_popcnt_available() &&
+ avx512_bw_available();
}
#endif /* TRY_POPCNT_FAST */
--
2.25.1
Import Notes
Reply to msg id not found: PH8PR11MB8286E1E9D1ADD14300A267CAFB0E2@PH8PR11MB8286.namprd11.prod.outlook.com
Thanks for the feedback. I've attached an updated patch.
(1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise zmm_regs_available() will return false.
(2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same cpuid leaf. You could combine them into one to avoid running cpuid twice. My apologies, I should have mentioned this before.
On Thu, Apr 18, 2024 at 09:29:55PM +0000, Devulapalli, Raghuveer wrote:
(1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise
zmm_regs_available() will return false..
Yes, that's a mistake. I fixed that in v3.
(2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the
same cpuid leaf. You could combine them into one to avoid running cpuid
twice. My apologies, I should have mentioned this before..
Good call. The byte-and-word instructions were a late addition to the
patch, so I missed this originally.
On that note, is it necessary to also check for avx512f? At the moment, we
are assuming that's supported if the other AVX-512 instructions are
available.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-osxsave.patchtext/x-diff; charset=us-asciiDownload
From e04c348eb389c6aa1597ac35d57b5e7ae7075381 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 18 Apr 2024 15:57:56 -0500
Subject: [PATCH v3 1/1] osxsave
---
src/port/pg_popcount_avx512_choose.c | 80 ++++++++++++++++------------
1 file changed, 47 insertions(+), 33 deletions(-)
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
index ae3fa3d306..b37107803a 100644
--- a/src/port/pg_popcount_avx512_choose.c
+++ b/src/port/pg_popcount_avx512_choose.c
@@ -34,39 +34,13 @@
#ifdef TRY_POPCNT_FAST
/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
+ * Does CPUID say there's support for XSAVE instructions?
*/
-bool
-pg_popcount_avx512_available(void)
+static inline bool
+xsave_available(void)
{
unsigned int exx[4] = {0, 0, 0, 0};
- /* Does CPUID say there's support for AVX-512 popcount instructions? */
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
- if ((exx[2] & (1 << 14)) == 0) /* avx512-vpopcntdq */
- return false;
-
- /* Does CPUID say there's support for AVX-512 byte and word instructions? */
- memset(exx, 0, sizeof(exx));
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
- if ((exx[1] & (1 << 30)) == 0) /* avx512-bw */
- return false;
-
- /* Does CPUID say there's support for XSAVE instructions? */
- memset(exx, 0, sizeof(exx));
#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
@@ -74,15 +48,55 @@ pg_popcount_avx512_available(void)
#else
#error cpuid instruction not available
#endif
- if ((exx[2] & (1 << 26)) == 0) /* xsave */
- return false;
+ return (exx[2] & (1 << 27)) != 0; /* osxsave */
+}
- /* Does XGETBV say the ZMM registers are enabled? */
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+static inline bool
+zmm_regs_available(void)
+{
#ifdef HAVE_XSAVE_INTRINSICS
- return (_xgetbv(0) & 0xe0) != 0;
+ return (_xgetbv(0) & 0xe6) == 0xe6;
#else
return false;
#endif
}
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+ (exx[1] & (1 << 30)) != 0; /* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ return xsave_available() &&
+ zmm_regs_available() &&
+ avx512_popcnt_available();
+}
+
#endif /* TRY_POPCNT_FAST */
--
2.25.1
On that note, is it necessary to also check for avx512f? At the moment, we are assuming that's supported if the other AVX-512 instructions are available.
No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without avx512f. Unfortunately though, avx512popcnt does not mean avx512bw (I think the deprecated Xeon Phi processors falls in this category) which is why we need both.
On Thu, Apr 18, 2024 at 10:11:08PM +0000, Devulapalli, Raghuveer wrote:
On that note, is it necessary to also check for avx512f? At the moment,
we are assuming that's supported if the other AVX-512 instructions are
available.No, it's not needed. There are no CPU's with avx512bw/avx512popcnt
without avx512f. Unfortunately though, avx512popcnt does not mean
avx512bw (I think the deprecated Xeon Phi processors falls in this
category) which is why we need both.
Makes sense, thanks. I'm planning to commit this fix sometime early next
week.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
Makes sense, thanks. I'm planning to commit this fix sometime early next
week.
Committed.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
On 2024-04-23 11:02:07 -0500, Nathan Bossart wrote:
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote:
Makes sense, thanks. I'm planning to commit this fix sometime early next
week.Committed.
I've noticed that the configure probes for this are quite slow - pretty much
the slowest step in a meson setup (and autoconf is similar). While looking
into this, I also noticed that afaict the tests don't do the right thing for
msvc.
...
[6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
[6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
[6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
[7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
[8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
[8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO
[9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES
[9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO
[9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
[9.367] Checking if "x86_64: popcntq instruction" compiles: YES
[9.382] Has header "atomic.h" : NO
...
(the times here are a bit exaggerated, enabling them in meson also turns on
python profiling, which makes everything a bit slower)
Looks like this is largely the fault of including immintrin.h:
echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null
0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k
echo -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc - -o /dev/null
0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k
Do we really need to link the generated programs? If we instead were able to
just rely on the preprocessor, it'd be vastly faster.
The __sync* and __atomic* checks actually need to link, as the compiler ends
up generating calls to unimplemented functions if the compilation target
doesn't support some operation natively - but I don't think that's true for
the xsave/avx512 stuff
Afaict we could just check for predefined preprocessor macros:
echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
#define __AVX512BW__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __XSAVE__ 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k
echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k
Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170). However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.
Check out [2]https://godbolt.org/z/c8Kj8r3PK, it shows the various predefined macros for gcc, clang and msvc.
ISTM that the msvc checks for xsave/avx512 being broken should be an open
item?
Greetings,
Andres
[1]: https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170
[2]: https://godbolt.org/z/c8Kj8r3PK
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
I've noticed that the configure probes for this are quite slow - pretty much
the slowest step in a meson setup (and autoconf is similar). While looking
into this, I also noticed that afaict the tests don't do the right thing for
msvc....
[6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
[6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
[6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
[7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
[8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
[8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO
[9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES
[9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO
[9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
[9.367] Checking if "x86_64: popcntq instruction" compiles: YES
[9.382] Has header "atomic.h" : NO
...(the times here are a bit exaggerated, enabling them in meson also turns on
python profiling, which makes everything a bit slower)Looks like this is largely the fault of including immintrin.h:
echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null
0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)kecho -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc - -o /dev/null
0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k
Interesting. Thanks for bringing this to my attention.
Do we really need to link the generated programs? If we instead were able to
just rely on the preprocessor, it'd be vastly faster.The __sync* and __atomic* checks actually need to link, as the compiler ends
up generating calls to unimplemented functions if the compilation target
doesn't support some operation natively - but I don't think that's true for
the xsave/avx512 stuffAfaict we could just check for predefined preprocessor macros:
echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
#define __AVX512BW__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __XSAVE__ 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)kecho|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k
Seems promising. I can't think of a reason that wouldn't work.
Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]). However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.Check out [2], it shows the various predefined macros for gcc, clang and msvc.
ISTM that the msvc checks for xsave/avx512 being broken should be an open
item?
I'm not following this one. At the moment, we always do a runtime check
for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
set the function pointers appropriately, right? We could, of course, still
fix it, though.
--
nathan
On Tue, Jul 30, 2024 at 04:32:07PM -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
Afaict we could just check for predefined preprocessor macros:
echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
#define __AVX512BW__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __XSAVE__ 1
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)kecho|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)kSeems promising. I can't think of a reason that wouldn't work.
Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]). However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.
Hm. Upon further inspection, I see that MSVC appears to be missing
__XSAVE__ and __AVX512VPOPCNTDQ__, which is unfortunate. Still, I think
the worst case scenario is that the CPUID check fails and we don't use
AVX-512 instructions. AFAICT we aren't adding new function pointers in any
builds that don't already have them, just compiling some extra unused code.
--
nathan
Hi,
On 2024-07-30 16:32:07 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
Now, a reasonable counter-argument would be that only some of these macros are
defined for msvc ([1]). However, as it turns out, the test is broken
today, as msvc doesn't error out when using an intrinsic that's not
"available" by the target architecture, it seems to assume that the caller did
a cpuid check ahead of time.Check out [2], it shows the various predefined macros for gcc, clang and msvc.
ISTM that the msvc checks for xsave/avx512 being broken should be an open
item?I'm not following this one. At the moment, we always do a runtime check
for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
set the function pointers appropriately, right? We could, of course, still
fix it, though.
Ah, I somehow thought we'd avoid the runtime check in case we determine at
compile time we don't need any extra flags to enable the AVX512 stuff (similar
to how we deal with crc32). But it looks like that's not the case - which
seems pretty odd to me:
This turns something that can be a single instruction into an indirect
function call, even if we could know that it's guaranteed to be available for
the compilation target, due to -march=....
It's one thing for the avx512 path to have that overhead, but it's
particularly absurd for pg_popcount32/pg_popcount64, where
a) The function call overhead is a larger proportion of the cost.
b) the instruction is almost universally available, including in the
architecture baseline x86-64-v2, which several distros are using as the
x86-64 baseline.
Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for? Is that just to know if _xgetbv() exists? Is it actually possible
that xsave isn't available when avx512 is?
Greetings,
Andres Freund
On Wed, Jul 31, 2024 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
It's one thing for the avx512 path to have that overhead, but it's
particularly absurd for pg_popcount32/pg_popcount64, wherea) The function call overhead is a larger proportion of the cost.
b) the instruction is almost universally available, including in the
architecture baseline x86-64-v2, which several distros are using as the
x86-64 baseline.
FWIW, another recent thread about that:
/messages/by-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
Ah, I somehow thought we'd avoid the runtime check in case we determine at
compile time we don't need any extra flags to enable the AVX512 stuff (similar
to how we deal with crc32). But it looks like that's not the case - which
seems pretty odd to me:This turns something that can be a single instruction into an indirect
function call, even if we could know that it's guaranteed to be available for
the compilation target, due to -march=....It's one thing for the avx512 path to have that overhead, but it's
particularly absurd for pg_popcount32/pg_popcount64, wherea) The function call overhead is a larger proportion of the cost.
b) the instruction is almost universally available, including in the
architecture baseline x86-64-v2, which several distros are using as the
x86-64 baseline.
Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54). Until v17
(cc4826d), pg_popcount() repeatedly calls these function pointers, too. I
think it'd be awesome if we could start requiring some of these "almost
universally available" instructions, but AFAICT that brings its own
complexity [0]/messages/by-id/ZmpG2ZzT30Q75BZO@nathan.
Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for? Is that just to know if _xgetbv() exists? Is it actually possible
that xsave isn't available when avx512 is?
Yes, it's to verify we have XGETBV, which IIUC requires support from both
the processor and the OS (see 598e011 and upthread discussion). AFAIK the
way we are detecting AVX-512 support is quite literally by-the-book unless
I've gotten something wrong.
[0]: /messages/by-id/ZmpG2ZzT30Q75BZO@nathan
--
nathan
Hi,
On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
Ah, I somehow thought we'd avoid the runtime check in case we determine at
compile time we don't need any extra flags to enable the AVX512 stuff (similar
to how we deal with crc32). But it looks like that's not the case - which
seems pretty odd to me:This turns something that can be a single instruction into an indirect
function call, even if we could know that it's guaranteed to be available for
the compilation target, due to -march=....It's one thing for the avx512 path to have that overhead, but it's
particularly absurd for pg_popcount32/pg_popcount64, wherea) The function call overhead is a larger proportion of the cost.
b) the instruction is almost universally available, including in the
architecture baseline x86-64-v2, which several distros are using as the
x86-64 baseline.Yeah, pg_popcount32/64 have been doing this since v12 (02a6a54). Until v17
(cc4826d), pg_popcount() repeatedly calls these function pointers, too. I
think it'd be awesome if we could start requiring some of these "almost
universally available" instructions, but AFAICT that brings its own
complexity [0].
I'll respond there...
Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for? Is that just to know if _xgetbv() exists? Is it actually possible
that xsave isn't available when avx512 is?Yes, it's to verify we have XGETBV, which IIUC requires support from both
the processor and the OS (see 598e011 and upthread discussion). AFAIK the
way we are detecting AVX-512 support is quite literally by-the-book unless
I've gotten something wrong.
I'm basically wondering whether we need to check for compiler (not OS support)
support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
support. Afaict the latter implies support for xsave.
andres@alap6:~$ echo|gcc -c - -march=x86-64 -xc -dM -E - -o -|grep '__XSAVE__'
andres@alap6:~$ echo|gcc -c - -march=x86-64 -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep '__XSAVE__'
#define __XSAVE__ 1
#define __XSAVE__ 1
Greetings,
Andres Freund
On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote:
On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for? Is that just to know if _xgetbv() exists? Is it actually possible
that xsave isn't available when avx512 is?Yes, it's to verify we have XGETBV, which IIUC requires support from both
the processor and the OS (see 598e011 and upthread discussion). AFAIK the
way we are detecting AVX-512 support is quite literally by-the-book unless
I've gotten something wrong.I'm basically wondering whether we need to check for compiler (not OS support)
support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
support. Afaict the latter implies support for xsave.
The main purpose of the XSAVE compiler check is to determine whether we
need to add -mxsave in order to use _xgetbv() [0]/messages/by-id/20240330032209.GA2018686@nathanxps13. If that wasn't a
factor, we could probably skip it. Earlier versions of the patch used
inline assembly in the non-MSVC path to call XGETBV, which I was trying to
avoid.
[0]: /messages/by-id/20240330032209.GA2018686@nathanxps13
--
nathan
Hi,
On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote:
On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote:
Why are we actually checking for xsave? We're not using xsave itself and I
couldn't find a comment in 792752af4eb5 explaining what we're using it as a
proxy for? Is that just to know if _xgetbv() exists? Is it actually possible
that xsave isn't available when avx512 is?Yes, it's to verify we have XGETBV, which IIUC requires support from both
the processor and the OS (see 598e011 and upthread discussion). AFAIK the
way we are detecting AVX-512 support is quite literally by-the-book unless
I've gotten something wrong.I'm basically wondering whether we need to check for compiler (not OS support)
support for xsave if we also check for -mavx512vpopcntdq -mavx512bw
support. Afaict the latter implies support for xsave.The main purpose of the XSAVE compiler check is to determine whether we
need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a
factor, we could probably skip it. Earlier versions of the patch used
inline assembly in the non-MSVC path to call XGETBV, which I was trying to
avoid.
My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
alone, without needing -mxsave:
echo -e '#include <immintrin.h>\nint main() { return _xgetbv(0) & 0xe0; }'|time gcc -march=x86-64 -c -xc - -o /dev/null
-> fails
echo -e '#include <immintrin.h>\nint main() { return _xgetbv(0) & 0xe0;}'|time gcc -march=x86-64 -mavx512vpopcntdq -mavx512bw -c -xc - -o /dev/null
-> succeeds
Greetings,
Andres Freund
On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote:
The main purpose of the XSAVE compiler check is to determine whether we
need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a
factor, we could probably skip it. Earlier versions of the patch used
inline assembly in the non-MSVC path to call XGETBV, which I was trying to
avoid.My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
alone, without needing -mxsave:
Oh, I see. I'll work on a patch to remove that compiler check, then...
--
nathan
On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
alone, without needing -mxsave:Oh, I see. I'll work on a patch to remove that compiler check, then...
As I started on this, I remembered why I needed it. The file
pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
to avoid inadvertently issuing any AVX-512 instructions before determining
we have support. If that's not a concern, we could still probably remove
the XSAVE check.
--
nathan
Hi,
On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote:
On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote:
My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw
alone, without needing -mxsave:Oh, I see. I'll work on a patch to remove that compiler check, then...
As I started on this, I remembered why I needed it. The file
pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
to avoid inadvertently issuing any AVX-512 instructions before determining
we have support. If that's not a concern, we could still probably remove
the XSAVE check.
I think it's a valid concern - but isn't that theoretically also an issue with
xsave itself? I guess practically the compiler won't do that, because there's
no practical reason to emit any instructions enabled by -mxsave (in contrast
to e.g. -mavx, which does trigger gcc to emit different instructions even for
basic math).
I think this is one of the few instances where msvc has the right approach -
if I use intrinsics to emit a specific instruction, the intrinsic should do
so, regardless of whether the compiler is allowed to do so on its own.
I think enabling options like these on a per-translation-unit basis isn't
really a scalable approach. To actually be safe there could only be a single
function in each TU and that function could only be called after a cpuid check
performed in a separate TU. That a) ends up pretty unreadable b) requires
functions to be implemented in .c files, which we really don't want for some
of this.
I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:
https://godbolt.org/z/a4q9Gc6Ez
For posterity, in the unlikely case anybody reads this after godbolt shuts
down:
I'm thinking we'd have an attribute like this:
/*
* GCC like compilers don't support intrinsics without those intrinsics explicitly
* having been enabled. We can't just add these options more widely, as that allows the
* compiler to emit such instructions more widely, even if we gate reaching the code using
* intrinsics. So we just enable the relevant support for individual functions.
*
* In contrast to this, msvc allows use of intrinsics independent of what the compiler
* otherwise is allowed to emit.
*/
#ifdef __GNUC__
#define pg_enable_target(foo) __attribute__ ((__target__ (foo)))
#else
#define pg_enable_target(foo)
#endif
and then use that selectively for some functions:
/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...
Greetings,
Andres Freund
On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote:
As I started on this, I remembered why I needed it. The file
pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order
to avoid inadvertently issuing any AVX-512 instructions before determining
we have support. If that's not a concern, we could still probably remove
the XSAVE check.I think it's a valid concern - but isn't that theoretically also an issue with
xsave itself? I guess practically the compiler won't do that, because there's
no practical reason to emit any instructions enabled by -mxsave (in contrast
to e.g. -mavx, which does trigger gcc to emit different instructions even for
basic math).
Yeah, this crossed my mind. It's certainly not the sturdiest of
assumptions...
I think enabling options like these on a per-translation-unit basis isn't
really a scalable approach. To actually be safe there could only be a single
function in each TU and that function could only be called after a cpuid check
performed in a separate TU. That a) ends up pretty unreadable b) requires
functions to be implemented in .c files, which we really don't want for some
of this.
Agreed.
I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:[...]
/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...
I remember wondering why the CRC-32C code wasn't already doing something
like this (old compiler versions? non-gcc-like compilers?), and I'm not
sure I ever discovered the reason, so out of an abundance of caution I used
the same approach for AVX-512. If we can convince ourselves that
__attribute__((target("..."))) is standard enough at this point, +1 for
moving to that.
--
nathan
On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote:
On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:[...]
/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...I remember wondering why the CRC-32C code wasn't already doing something
like this (old compiler versions? non-gcc-like compilers?), and I'm not
sure I ever discovered the reason, so out of an abundance of caution I used
the same approach for AVX-512. If we can convince ourselves that
__attribute__((target("..."))) is standard enough at this point, +1 for
moving to that.
I looked into this some more, and found the following:
* We added SSE 4.2 CRC support in April 2015 (commit 3dc2d62). gcc support
for __attribute__((target("sse4.2"))) was added in 4.9.0 (April 2014).
clang support was added in 3.8 (March 2016).
* We added ARMv8 CRC support in April 2018 (commit f044d71). gcc support
for __attribute__((target("+crc"))) was added in 6.3 (December 2016). I
didn't find precisely when clang support was added, but until 16.0.0
(March 2023), including arm_acle.h requires the -march flag [0]https://github.com/llvm/llvm-project/commit/30b67c6, and you
had to use "crc" (plus sign omitted) as the target [1]https://releases.llvm.org/16.0.0/tools/clang/docs/ReleaseNotes.html#arm-and-aarch64-support.
* We added AVX-512 support in April 2024 (commit 792752a). gcc support for
__attribute__((target("avx512vpopcntdq,avx512bw"))) was added in 7.1 (May
2017). clang support was added in 5.0.0 (September 2017). However, the
"xsave" target was not supported until 9.1 for gcc (May 2019) and 9.0.0
for clang (September 2019), and we need that for our AVX-512 code, too.
So, at least for the CRC code, __attribute__((target("..."))) was probably
not widely available enough yet when it was first added. Unfortunately,
the ARMv8 CRC target support (without -march) is still pretty new, but it
might be possible to switch the others to a per-function approach in v18.
[0]: https://github.com/llvm/llvm-project/commit/30b67c6
[1]: https://releases.llvm.org/16.0.0/tools/clang/docs/ReleaseNotes.html#arm-and-aarch64-support
--
nathan
On Tue, Oct 08, 2024 at 09:36:03PM -0500, Nathan Bossart wrote:
On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote:
On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote:
I think we'd be better off enabling architectural features on a per-function
basis, roughly like this:[...]
/* FIXME: Should be gated by configure check of -mavx512vpopcntdq -mavx512bw support */
pg_enable_target("avx512vpopcntdq,avx512bw")
uint64_t
pg_popcount_avx512(const char *buf, int bytes)
...I remember wondering why the CRC-32C code wasn't already doing something
like this (old compiler versions? non-gcc-like compilers?), and I'm not
sure I ever discovered the reason, so out of an abundance of caution I used
the same approach for AVX-512. If we can convince ourselves that
__attribute__((target("..."))) is standard enough at this point, +1 for
moving to that.[...]
So, at least for the CRC code, __attribute__((target("..."))) was probably
not widely available enough yet when it was first added. Unfortunately,
the ARMv8 CRC target support (without -march) is still pretty new, but it
might be possible to switch the others to a per-function approach in v18.
Here is a first attempt at using __attribute__((target("..."))) for the
AVX-512 stuff. Besides allowing us to consolidate the code into a single
file, this simplifies the build file changes quite a bit.
--
nathan
Attachments:
v1-0001-use-__attribute__-target-.-for-AVX-512-stuff.patchtext/plain; charset=us-asciiDownload
From c97e25e56347c90f169a5ce069a9ea06c873915b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v1 1/1] use __attribute__((target(...))) for AVX-512 stuff
---
config/c-compiler.m4 | 60 +++++-----
configure | 163 ++++++---------------------
configure.ac | 17 +--
meson.build | 17 +--
src/Makefile.global.in | 5 -
src/include/c.h | 10 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 12 +-
src/port/meson.build | 7 +-
src/port/pg_popcount_avx512.c | 86 +++++++++++++-
src/port/pg_popcount_avx512_choose.c | 102 -----------------
11 files changed, 171 insertions(+), 312 deletions(-)
delete mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..aa90f8ef33 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl
# Check if the compiler supports the XSAVE instructions using the _xgetbv
# intrinsic function.
#
-# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ __attribute__((target("xsave")))
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }],
+ [return xsave_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_XSAVE="$1"
pgac_xsave_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
@@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl
# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
#
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw). If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [const char buf@<:@sizeof(__m512i)@:>@;
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ static int popcount_test(void)
+ {
+ const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }],
+ [return popcount_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_POPCNT="$1"
pgac_avx512_popcnt_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 3a577e463b..028f448a29 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17270,185 +17267,99 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
-if ${pgac_cv_xsave_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5
+$as_echo_n "checking for _xgetbv... " >&6; }
+if ${pgac_cv_xsave_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-return _xgetbv(0) & 0xe0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics_=yes
-else
- pgac_cv_xsave_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
-$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
-if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
- CFLAGS_XSAVE=""
- pgac_xsave_intrinsics=yes
-fi
-
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
-if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mxsave"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ __attribute__((target("xsave")))
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }
int
main ()
{
-return _xgetbv(0) & 0xe0;
+return xsave_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics__mxsave=yes
+ pgac_cv_xsave_intrinsics=yes
else
- pgac_cv_xsave_intrinsics__mxsave=no
+ pgac_cv_xsave_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
-$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
-if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
- CFLAGS_XSAVE="-mxsave"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics" >&5
+$as_echo "$pgac_cv_xsave_intrinsics" >&6; }
+if test x"$pgac_cv_xsave_intrinsics" = x"yes"; then
pgac_xsave_intrinsics=yes
fi
-fi
if test x"$pgac_xsave_intrinsics" = x"yes"; then
$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
fi
-
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics_=yes
-else
- pgac_cv_avx512_popcnt_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
- CFLAGS_POPCNT=""
- pgac_avx512_popcnt_intrinsics=yes
-fi
-
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ static int popcount_test(void)
+ {
+ const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }
int
main ()
{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
+return popcount_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+ pgac_cv_avx512_popcnt_intrinsics=yes
else
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+ pgac_cv_avx512_popcnt_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
- CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics" = x"yes"; then
pgac_avx512_popcnt_intrinsics=yes
fi
- fi
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
fi
fi
-
-
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 55f6c46d33..8bdb9c576a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2052,32 +2052,19 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-PGAC_XSAVE_INTRINSICS([])
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- PGAC_XSAVE_INTRINSICS([-mxsave])
-fi
+PGAC_XSAVE_INTRINSICS()
if test x"$pgac_xsave_intrinsics" = x"yes"; then
AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
fi
-AC_SUBST(CFLAGS_XSAVE)
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- PGAC_AVX512_POPCNT_INTRINSICS([])
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
- fi
+ PGAC_AVX512_POPCNT_INTRINSICS()
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
fi
fi
-AC_SUBST(CFLAGS_POPCNT)
-AC_SUBST(PG_POPCNT_OBJS)
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
diff --git a/meson.build b/meson.build
index 58e67975e8..9492f9d52e 100644
--- a/meson.build
+++ b/meson.build
@@ -2157,25 +2157,20 @@ endforeach
# Check for the availability of XSAVE intrinsics.
###############################################################
-cflags_xsave = []
if host_cpu == 'x86' or host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+__attribute__((target("xsave")))
int main(void)
{
return _xgetbv(0) & 0xe0;
}
'''
- if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
- args: test_c_args)
+ if cc.links(prog, name: 'XSAVE intrinsics', args: test_c_args)
cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
- args: test_c_args + ['-mxsave'])
- cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- cflags_xsave += '-mxsave'
endif
endif
@@ -2185,12 +2180,12 @@ endif
# Check for the availability of AVX-512 popcount intrinsics.
###############################################################
-cflags_popcnt = []
if host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+__attribute__((target("avx512vpopcntdq","avx512bw")))
int main(void)
{
const char buf[sizeof(__m512i)];
@@ -2205,13 +2200,9 @@ int main(void)
}
'''
- if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+ if cc.links(prog, name: 'AVX-512 popcount',
args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
- args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
- cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
endif
endif
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b4976..45696247e9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,9 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
-CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
-CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
CXXFLAGS = @CXXFLAGS@
@@ -762,9 +760,6 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
-# files needed for the chosen popcount implementation
-PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
-
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..6f5ca25542 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -174,6 +174,16 @@
#define pg_attribute_nonnull(...)
#endif
+/*
+ * pg_attribute_target allows specifying different target options that the
+ * function should be compiled with (e.g., for using special CPU instructions).
+ */
+#if __has_attribute (target)
+#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
+#else
+#define pg_attribute_target(...)
+#endif
+
/*
* Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
* used in assert-enabled builds, to avoid compiler warnings about unused
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e927584..479aa08420 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,10 +102,8 @@ pgxs_kv = {
' '.join(cflags_no_missing_var_decls),
'CFLAGS_CRC': ' '.join(cflags_crc),
- 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
- 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -181,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index 9324ec2d9f..366c814bd9 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,13 +38,13 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
- $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
@@ -92,16 +92,6 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
-
-# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
-pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
-
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 1150966ab7..83a0632520 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,8 +85,6 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
- ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
- ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +99,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
-pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
+pgport_cflags = {'crc': cflags_crc}
+pgport_sources_cflags = {'crc': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 9d3149e2d0..b598e86554 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,7 +12,17 @@
*/
#include "c.h"
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
#include "port/pg_bitutils.h"
@@ -21,12 +31,82 @@
* use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
* the function pointers that are only used when TRY_POPCNT_FAST is set.
*/
-#ifdef TRY_POPCNT_FAST
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+
+/*
+ * Does CPUID say there's support for XSAVE instructions?
+ */
+static inline bool
+xsave_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 27)) != 0; /* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+ return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+ (exx[1] & (1 << 30)) != 0; /* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ return xsave_available() &&
+ zmm_regs_available() &&
+ avx512_popcnt_available();
+}
/*
* pg_popcount_avx512
* Returns the number of 1-bits in buf
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
@@ -82,6 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
* pg_popcount_masked_avx512
* Returns the number of 1-bits in buf after applying the mask to each byte
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
{
@@ -138,4 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
return _mm512_reduce_add_epi64(accum);
}
-#endif /* TRY_POPCNT_FAST */
+#endif /* TRY_POPCNT_FAST &&
+ * USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- * Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- * src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 27)) != 0; /* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
- return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
- return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
- (exx[1] & (1 << 30)) != 0; /* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
- return xsave_available() &&
- zmm_regs_available() &&
- avx512_popcnt_available();
-}
-
-#endif /* TRY_POPCNT_FAST */
--
2.39.5 (Apple Git-154)
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, failed
Spec compliant: tested, failed
Documentation: tested, failed
Changes LGTM. Makes the Makefile look clean. Built and ran tests with `make check` with gcc-13 on a ICX and gcc-11 on SKX. I built on top of this patch and converted SSE4.2 and AVX-512 CRC32C to use function attributes too.
The new status of this patch is: Ready for Committer
BTW, I just realized function attributes for xsave and avx512 don't work on MSVC (see https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). Not sure if you care about it. Its an easy fix (see https://gcc.godbolt.org/z/Pebdj3vMx).
On Wed, Oct 30, 2024 at 08:53:10PM +0000, Raghuveer Devulapalli wrote:
BTW, I just realized function attributes for xsave and avx512 don't work
on MSVC (see
https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
Not sure if you care about it. Its an easy fix (see
https://gcc.godbolt.org/z/Pebdj3vMx).
Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the
configure programs for meson. pg_attribute_target will be empty on MSVC,
and I believe we only support meson builds there.
--
nathan
Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the
configure programs for meson. pg_attribute_target will be empty on MSVC, and I
believe we only support meson builds there.
Right. __has_attribute (target) produces a compiler warning on MSVC: https://gcc.godbolt.org/z/EfWGxbvj3. Might need to guard that with #if defined(__has_attribute) to get rid of it.
Show quoted text
--
nathan
On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote:
On Wed, Oct 30, 2024 at 08:53:10PM +0000, Raghuveer Devulapalli wrote:
BTW, I just realized function attributes for xsave and avx512 don't work
on MSVC (see
https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
Not sure if you care about it. Its an easy fix (see
https://gcc.godbolt.org/z/Pebdj3vMx).Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the
configure programs for meson. pg_attribute_target will be empty on MSVC,
and I believe we only support meson builds there.
Here is an updated patch with this change.
--
nathan
Attachments:
v2-0001-use-__attribute__-target-.-for-AVX-512-stuff.patchtext/plain; charset=us-asciiDownload
From 8cf7c08220a9c0a1dec809794af2dfb719981923 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v2 1/1] use __attribute__((target(...))) for AVX-512 stuff
---
config/c-compiler.m4 | 60 +++++-----
configure | 163 ++++++---------------------
configure.ac | 17 +--
meson.build | 21 ++--
src/Makefile.global.in | 5 -
src/include/c.h | 10 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 12 +-
src/port/meson.build | 7 +-
src/port/pg_popcount_avx512.c | 86 +++++++++++++-
src/port/pg_popcount_avx512_choose.c | 102 -----------------
11 files changed, 175 insertions(+), 312 deletions(-)
delete mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..aa90f8ef33 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl
# Check if the compiler supports the XSAVE instructions using the _xgetbv
# intrinsic function.
#
-# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ __attribute__((target("xsave")))
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }],
+ [return xsave_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_XSAVE="$1"
pgac_xsave_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
@@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl
# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
#
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw). If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [const char buf@<:@sizeof(__m512i)@:>@;
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ static int popcount_test(void)
+ {
+ const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }],
+ [return popcount_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_POPCNT="$1"
pgac_avx512_popcnt_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 268ac94ae6..354c8740e1 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17266,185 +17263,99 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
-if ${pgac_cv_xsave_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5
+$as_echo_n "checking for _xgetbv... " >&6; }
+if ${pgac_cv_xsave_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-return _xgetbv(0) & 0xe0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics_=yes
-else
- pgac_cv_xsave_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
-$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
-if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
- CFLAGS_XSAVE=""
- pgac_xsave_intrinsics=yes
-fi
-
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
-if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mxsave"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ __attribute__((target("xsave")))
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }
int
main ()
{
-return _xgetbv(0) & 0xe0;
+return xsave_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics__mxsave=yes
+ pgac_cv_xsave_intrinsics=yes
else
- pgac_cv_xsave_intrinsics__mxsave=no
+ pgac_cv_xsave_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
-$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
-if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
- CFLAGS_XSAVE="-mxsave"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics" >&5
+$as_echo "$pgac_cv_xsave_intrinsics" >&6; }
+if test x"$pgac_cv_xsave_intrinsics" = x"yes"; then
pgac_xsave_intrinsics=yes
fi
-fi
if test x"$pgac_xsave_intrinsics" = x"yes"; then
$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
fi
-
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics_=yes
-else
- pgac_cv_avx512_popcnt_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
- CFLAGS_POPCNT=""
- pgac_avx512_popcnt_intrinsics=yes
-fi
-
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ static int popcount_test(void)
+ {
+ const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }
int
main ()
{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
+return popcount_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+ pgac_cv_avx512_popcnt_intrinsics=yes
else
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+ pgac_cv_avx512_popcnt_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
- CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics" = x"yes"; then
pgac_avx512_popcnt_intrinsics=yes
fi
- fi
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
fi
fi
-
-
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 3c89b54bf1..d2ee63ab03 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2048,32 +2048,19 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-PGAC_XSAVE_INTRINSICS([])
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- PGAC_XSAVE_INTRINSICS([-mxsave])
-fi
+PGAC_XSAVE_INTRINSICS()
if test x"$pgac_xsave_intrinsics" = x"yes"; then
AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
fi
-AC_SUBST(CFLAGS_XSAVE)
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- PGAC_AVX512_POPCNT_INTRINSICS([])
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
- fi
+ PGAC_AVX512_POPCNT_INTRINSICS()
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
fi
fi
-AC_SUBST(CFLAGS_POPCNT)
-AC_SUBST(PG_POPCNT_OBJS)
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
diff --git a/meson.build b/meson.build
index bb9d7f5a8e..ef162bd7a7 100644
--- a/meson.build
+++ b/meson.build
@@ -2153,25 +2153,22 @@ endforeach
# Check for the availability of XSAVE intrinsics.
###############################################################
-cflags_xsave = []
if host_cpu == 'x86' or host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+#ifndef _MSC_VER
+__attribute__((target("xsave")))
+#endif
int main(void)
{
return _xgetbv(0) & 0xe0;
}
'''
- if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
- args: test_c_args)
- cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
- args: test_c_args + ['-mxsave'])
+ if cc.links(prog, name: 'XSAVE intrinsics', args: test_c_args)
cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- cflags_xsave += '-mxsave'
endif
endif
@@ -2181,12 +2178,14 @@ endif
# Check for the availability of AVX-512 popcount intrinsics.
###############################################################
-cflags_popcnt = []
if host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+#ifndef _MSC_VER
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+#endif
int main(void)
{
const char buf[sizeof(__m512i)];
@@ -2201,13 +2200,9 @@ int main(void)
}
'''
- if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+ if cc.links(prog, name: 'AVX-512 popcount',
args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
- args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
- cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
endif
endif
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b4976..45696247e9 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,9 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
-CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
-CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
CXXFLAGS = @CXXFLAGS@
@@ -762,9 +760,6 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
-# files needed for the chosen popcount implementation
-PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
-
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..6f5ca25542 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -174,6 +174,16 @@
#define pg_attribute_nonnull(...)
#endif
+/*
+ * pg_attribute_target allows specifying different target options that the
+ * function should be compiled with (e.g., for using special CPU instructions).
+ */
+#if __has_attribute (target)
+#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
+#else
+#define pg_attribute_target(...)
+#endif
+
/*
* Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
* used in assert-enabled builds, to avoid compiler warnings about unused
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e927584..479aa08420 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,10 +102,8 @@ pgxs_kv = {
' '.join(cflags_no_missing_var_decls),
'CFLAGS_CRC': ' '.join(cflags_crc),
- 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
- 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -181,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
'DTRACEFLAGS', # only server has dtrace probes
'perl_archlibexp', 'perl_embed_ccflags', 'perl_embed_ldflags', 'perl_includespec', 'perl_privlibexp',
diff --git a/src/port/Makefile b/src/port/Makefile
index 9324ec2d9f..366c814bd9 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,13 +38,13 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
- $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
@@ -92,16 +92,6 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
-
-# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
-pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
-
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 1150966ab7..83a0632520 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,8 +85,6 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
- ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
- ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +99,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
-pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
+pgport_cflags = {'crc': cflags_crc}
+pgport_sources_cflags = {'crc': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 9d3149e2d0..b598e86554 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,7 +12,17 @@
*/
#include "c.h"
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
#include "port/pg_bitutils.h"
@@ -21,12 +31,82 @@
* use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
* the function pointers that are only used when TRY_POPCNT_FAST is set.
*/
-#ifdef TRY_POPCNT_FAST
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+
+/*
+ * Does CPUID say there's support for XSAVE instructions?
+ */
+static inline bool
+xsave_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 27)) != 0; /* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+ return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+ (exx[1] & (1 << 30)) != 0; /* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ return xsave_available() &&
+ zmm_regs_available() &&
+ avx512_popcnt_available();
+}
/*
* pg_popcount_avx512
* Returns the number of 1-bits in buf
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
@@ -82,6 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
* pg_popcount_masked_avx512
* Returns the number of 1-bits in buf after applying the mask to each byte
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
{
@@ -138,4 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
return _mm512_reduce_add_epi64(accum);
}
-#endif /* TRY_POPCNT_FAST */
+#endif /* TRY_POPCNT_FAST &&
+ * USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- * Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- * src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 27)) != 0; /* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
- return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
- return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
- (exx[1] & (1 << 30)) != 0; /* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
- return xsave_available() &&
- zmm_regs_available() &&
- avx512_popcnt_available();
-}
-
-#endif /* TRY_POPCNT_FAST */
--
2.39.5 (Apple Git-154)
On Thu, Oct 31, 2024 at 07:58:06PM +0000, Devulapalli, Raghuveer wrote:
LGTM.
Thanks. Barring additional feedback, I plan to commit this soon.
--
nathan
rebased
--
nathan
Attachments:
v3-0001-use-__attribute__-target-.-for-AVX-512-stuff.patchtext/plain; charset=us-asciiDownload
From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff
---
config/c-compiler.m4 | 64 +++++-----
configure | 167 +++++++--------------------
configure.ac | 17 +--
meson.build | 21 ++--
src/Makefile.global.in | 5 -
src/include/c.h | 10 ++
src/makefiles/meson.build | 4 +-
src/port/Makefile | 12 +-
src/port/meson.build | 7 +-
src/port/pg_popcount_avx512.c | 86 +++++++++++++-
src/port/pg_popcount_avx512_choose.c | 102 ----------------
11 files changed, 183 insertions(+), 312 deletions(-)
delete mode 100644 src/port/pg_popcount_avx512_choose.c
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..c7eb896f14 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,22 @@ undefine([Ac_cachevar])dnl
# Check if the compiler supports the XSAVE instructions using the _xgetbv
# intrinsic function.
#
-# An optional compiler flag can be passed as argument (e.g., -mxsave). If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ #if defined(__has_attribute) && __has_attribute (target)
+ __attribute__((target("xsave")))
+ #endif
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }],
+ [return xsave_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_XSAVE="$1"
pgac_xsave_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
@@ -725,29 +727,29 @@ undefine([Ac_cachevar])dnl
# _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
#
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw). If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
- [const char buf@<:@sizeof(__m512i)@:>@;
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
+ #if defined(__has_attribute) && __has_attribute (target)
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ #endif
+ static int popcount_test(void)
+ {
+ const char buf@<:@sizeof(__m512i)@:>@;
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }],
+ [return popcount_test();])],
[Ac_cachevar=yes],
- [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+ [Ac_cachevar=no])])
if test x"$Ac_cachevar" = x"yes"; then
- CFLAGS_POPCNT="$1"
pgac_avx512_popcnt_intrinsics=yes
fi
undefine([Ac_cachevar])dnl
diff --git a/configure b/configure
index 6e256b417b..3a7332f834 100755
--- a/configure
+++ b/configure
@@ -647,9 +647,6 @@ MSGFMT_FLAGS
MSGFMT
PG_CRC32C_OBJS
CFLAGS_CRC
-PG_POPCNT_OBJS
-CFLAGS_POPCNT
-CFLAGS_XSAVE
LIBOBJS
OPENSSL
ZSTD
@@ -17272,185 +17269,103 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=... " >&6; }
-if ${pgac_cv_xsave_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5
+$as_echo_n "checking for _xgetbv... " >&6; }
+if ${pgac_cv_xsave_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-return _xgetbv(0) & 0xe0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics_=yes
-else
- pgac_cv_xsave_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics_" >&5
-$as_echo "$pgac_cv_xsave_intrinsics_" >&6; }
-if test x"$pgac_cv_xsave_intrinsics_" = x"yes"; then
- CFLAGS_XSAVE=""
- pgac_xsave_intrinsics=yes
-fi
-
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv with CFLAGS=-mxsave" >&5
-$as_echo_n "checking for _xgetbv with CFLAGS=-mxsave... " >&6; }
-if ${pgac_cv_xsave_intrinsics__mxsave+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mxsave"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ #if defined(__has_attribute) && __has_attribute (target)
+ __attribute__((target("xsave")))
+ #endif
+ static int xsave_test(void)
+ {
+ return _xgetbv(0) & 0xe0;
+ }
int
main ()
{
-return _xgetbv(0) & 0xe0;
+return xsave_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_xsave_intrinsics__mxsave=yes
+ pgac_cv_xsave_intrinsics=yes
else
- pgac_cv_xsave_intrinsics__mxsave=no
+ pgac_cv_xsave_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics__mxsave" >&5
-$as_echo "$pgac_cv_xsave_intrinsics__mxsave" >&6; }
-if test x"$pgac_cv_xsave_intrinsics__mxsave" = x"yes"; then
- CFLAGS_XSAVE="-mxsave"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_xsave_intrinsics" >&5
+$as_echo "$pgac_cv_xsave_intrinsics" >&6; }
+if test x"$pgac_cv_xsave_intrinsics" = x"yes"; then
pgac_xsave_intrinsics=yes
fi
-fi
if test x"$pgac_xsave_intrinsics" = x"yes"; then
$as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h
fi
-
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics_+:} false; then :
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64" >&5
+$as_echo_n "checking for _mm512_popcnt_epi64... " >&6; }
+if ${pgac_cv_avx512_popcnt_intrinsics+:} false; then :
$as_echo_n "(cached) " >&6
else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS "
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h. */
-#include <immintrin.h>
-int
-main ()
-{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
- ;
- return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics_=yes
-else
- pgac_cv_avx512_popcnt_intrinsics_=no
-fi
-rm -f core conftest.err conftest.$ac_objext \
- conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics_" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics_" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics_" = x"yes"; then
- CFLAGS_POPCNT=""
- pgac_avx512_popcnt_intrinsics=yes
-fi
-
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw" >&5
-$as_echo_n "checking for _mm512_popcnt_epi64 with CFLAGS=-mavx512vpopcntdq -mavx512bw... " >&6; }
-if ${pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mavx512vpopcntdq -mavx512bw"
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
#include <immintrin.h>
+ #if defined(__has_attribute) && __has_attribute (target)
+ __attribute__((target("avx512vpopcntdq","avx512bw")))
+ #endif
+ static int popcount_test(void)
+ {
+ const char buf[sizeof(__m512i)];
+ PG_INT64_TYPE popcnt = 0;
+ __m512i accum = _mm512_setzero_si512();
+ const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+ const __m512i cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ popcnt = _mm512_reduce_add_epi64(accum);
+ return (int) popcnt;
+ }
int
main ()
{
-const char buf[sizeof(__m512i)];
- PG_INT64_TYPE popcnt = 0;
- __m512i accum = _mm512_setzero_si512();
- const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
- const __m512i cnt = _mm512_popcnt_epi64(val);
- accum = _mm512_add_epi64(accum, cnt);
- popcnt = _mm512_reduce_add_epi64(accum);
- /* return computed value, to prevent the above being optimized away */
- return popcnt == 0;
+return popcount_test();
;
return 0;
}
_ACEOF
if ac_fn_c_try_link "$LINENO"; then :
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=yes
+ pgac_cv_avx512_popcnt_intrinsics=yes
else
- pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw=no
+ pgac_cv_avx512_popcnt_intrinsics=no
fi
rm -f core conftest.err conftest.$ac_objext \
conftest$ac_exeext conftest.$ac_ext
-CFLAGS="$pgac_save_CFLAGS"
fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&5
-$as_echo "$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" >&6; }
-if test x"$pgac_cv_avx512_popcnt_intrinsics__mavx512vpopcntdq__mavx512bw" = x"yes"; then
- CFLAGS_POPCNT="-mavx512vpopcntdq -mavx512bw"
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx512_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_avx512_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_avx512_popcnt_intrinsics" = x"yes"; then
pgac_avx512_popcnt_intrinsics=yes
fi
- fi
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
$as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
fi
fi
-
-
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
diff --git a/configure.ac b/configure.ac
index 3992694dac..e7f4f0fc22 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2050,32 +2050,19 @@ fi
# Check for XSAVE intrinsics
#
-CFLAGS_XSAVE=""
-PGAC_XSAVE_INTRINSICS([])
-if test x"$pgac_xsave_intrinsics" != x"yes"; then
- PGAC_XSAVE_INTRINSICS([-mxsave])
-fi
+PGAC_XSAVE_INTRINSICS()
if test x"$pgac_xsave_intrinsics" = x"yes"; then
AC_DEFINE(HAVE_XSAVE_INTRINSICS, 1, [Define to 1 if you have XSAVE intrinsics.])
fi
-AC_SUBST(CFLAGS_XSAVE)
# Check for AVX-512 popcount intrinsics
#
-CFLAGS_POPCNT=""
-PG_POPCNT_OBJS=""
if test x"$host_cpu" = x"x86_64"; then
- PGAC_AVX512_POPCNT_INTRINSICS([])
- if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
- PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512bw])
- fi
+ PGAC_AVX512_POPCNT_INTRINSICS()
if test x"$pgac_avx512_popcnt_intrinsics" = x"yes"; then
- PG_POPCNT_OBJS="pg_popcount_avx512.o pg_popcount_avx512_choose.o"
AC_DEFINE(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX-512 popcount instructions with a runtime check.])
fi
fi
-AC_SUBST(CFLAGS_POPCNT)
-AC_SUBST(PG_POPCNT_OBJS)
# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
#
diff --git a/meson.build b/meson.build
index 9a98f0c86a..9eddd72a27 100644
--- a/meson.build
+++ b/meson.build
@@ -2153,25 +2153,22 @@ endforeach
# Check for the availability of XSAVE intrinsics.
###############################################################
-cflags_xsave = []
if host_cpu == 'x86' or host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("xsave")))
+#endif
int main(void)
{
return _xgetbv(0) & 0xe0;
}
'''
- if cc.links(prog, name: 'XSAVE intrinsics without -mxsave',
- args: test_c_args)
- cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- elif cc.links(prog, name: 'XSAVE intrinsics with -mxsave',
- args: test_c_args + ['-mxsave'])
+ if cc.links(prog, name: 'XSAVE intrinsics', args: test_c_args)
cdata.set('HAVE_XSAVE_INTRINSICS', 1)
- cflags_xsave += '-mxsave'
endif
endif
@@ -2181,12 +2178,14 @@ endif
# Check for the availability of AVX-512 popcount intrinsics.
###############################################################
-cflags_popcnt = []
if host_cpu == 'x86_64'
prog = '''
#include <immintrin.h>
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+#endif
int main(void)
{
const char buf[sizeof(__m512i)];
@@ -2201,13 +2200,9 @@ int main(void)
}
'''
- if cc.links(prog, name: 'AVX-512 popcount without -mavx512vpopcntdq -mavx512bw',
+ if cc.links(prog, name: 'AVX-512 popcount',
args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))])
cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- elif cc.links(prog, name: 'AVX-512 popcount with -mavx512vpopcntdq -mavx512bw',
- args: test_c_args + ['-DINT64=@0@'.format(cdata.get('PG_INT64_TYPE'))] + ['-mavx512vpopcntdq'] + ['-mavx512bw'])
- cdata.set('USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 1)
- cflags_popcnt += ['-mavx512vpopcntdq'] + ['-mavx512bw']
endif
endif
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 4859343153..0f38d712d1 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,9 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
-CFLAGS_POPCNT = @CFLAGS_POPCNT@
CFLAGS_CRC = @CFLAGS_CRC@
-CFLAGS_XSAVE = @CFLAGS_XSAVE@
PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
CXXFLAGS = @CXXFLAGS@
@@ -772,9 +770,6 @@ LIBOBJS = @LIBOBJS@
# files needed for the chosen CRC-32C implementation
PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
-# files needed for the chosen popcount implementation
-PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
-
LIBS := -lpgcommon -lpgport $(LIBS)
# to make ws2_32.lib the last library
diff --git a/src/include/c.h b/src/include/c.h
index 55dec71a6d..6f5ca25542 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -174,6 +174,16 @@
#define pg_attribute_nonnull(...)
#endif
+/*
+ * pg_attribute_target allows specifying different target options that the
+ * function should be compiled with (e.g., for using special CPU instructions).
+ */
+#if __has_attribute (target)
+#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__)))
+#else
+#define pg_attribute_target(...)
+#endif
+
/*
* Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
* used in assert-enabled builds, to avoid compiler warnings about unused
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index e13938fe8a..aba7411a1b 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,10 +102,8 @@ pgxs_kv = {
' '.join(cflags_no_missing_var_decls),
'CFLAGS_CRC': ' '.join(cflags_crc),
- 'CFLAGS_POPCNT': ' '.join(cflags_popcnt),
'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
- 'CFLAGS_XSAVE': ' '.join(cflags_xsave),
'LDFLAGS': var_ldflags,
'LDFLAGS_EX': var_ldflags_ex,
@@ -181,7 +179,7 @@ pgxs_empty = [
'WANTED_LANGUAGES',
# Not needed because we don't build the server / PLs with the generated makefile
- 'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'TAS',
+ 'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
'PG_TEST_EXTRA',
'DTRACEFLAGS', # only server has dtrace probes
diff --git a/src/port/Makefile b/src/port/Makefile
index 9324ec2d9f..366c814bd9 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,13 +38,13 @@ LIBS += $(PTHREAD_LIBS)
OBJS = \
$(LIBOBJS) \
$(PG_CRC32C_OBJS) \
- $(PG_POPCNT_OBJS) \
bsearch_arg.o \
chklocale.o \
inet_net_ntop.o \
noblock.o \
path.o \
pg_bitutils.o \
+ pg_popcount_avx512.o \
pg_strong_random.o \
pgcheckdir.o \
pgmkdirp.o \
@@ -92,16 +92,6 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
-# all versions of pg_popcount_avx512_choose.o need CFLAGS_XSAVE
-pg_popcount_avx512_choose.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_shlib.o: CFLAGS+=$(CFLAGS_XSAVE)
-pg_popcount_avx512_choose_srv.o: CFLAGS+=$(CFLAGS_XSAVE)
-
-# all versions of pg_popcount_avx512.o need CFLAGS_POPCNT
-pg_popcount_avx512.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
-pg_popcount_avx512_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
-
#
# Shared library versions of object files
#
diff --git a/src/port/meson.build b/src/port/meson.build
index 1150966ab7..83a0632520 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
'noblock.c',
'path.c',
'pg_bitutils.c',
+ 'pg_popcount_avx512.c',
'pg_strong_random.c',
'pgcheckdir.c',
'pgmkdirp.c',
@@ -84,8 +85,6 @@ replace_funcs_pos = [
['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
- ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
- ['pg_popcount_avx512_choose', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 'xsave'],
# arm / aarch64
['pg_crc32c_armv8', 'USE_ARMV8_CRC32C'],
@@ -100,8 +99,8 @@ replace_funcs_pos = [
['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
]
-pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt, 'xsave': cflags_xsave}
-pgport_sources_cflags = {'crc': [], 'popcnt': [], 'xsave': []}
+pgport_cflags = {'crc': cflags_crc}
+pgport_sources_cflags = {'crc': []}
foreach f : replace_funcs_neg
func = f.get(0)
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 9d3149e2d0..b598e86554 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -12,7 +12,17 @@
*/
#include "c.h"
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#include <cpuid.h>
+#endif
+
+#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
#include <immintrin.h>
+#endif
+
+#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#include <intrin.h>
+#endif
#include "port/pg_bitutils.h"
@@ -21,12 +31,82 @@
* use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
* the function pointers that are only used when TRY_POPCNT_FAST is set.
*/
-#ifdef TRY_POPCNT_FAST
+#if defined(TRY_POPCNT_FAST) && defined(USE_AVX512_POPCNT_WITH_RUNTIME_CHECK)
+
+/*
+ * Does CPUID say there's support for XSAVE instructions?
+ */
+static inline bool
+xsave_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(exx, 1);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 27)) != 0; /* osxsave */
+}
+
+/*
+ * Does XGETBV say the ZMM registers are enabled?
+ *
+ * NB: Caller is responsible for verifying that xsave_available() returns true
+ * before calling this.
+ */
+#ifdef HAVE_XSAVE_INTRINSICS
+pg_attribute_target("xsave")
+#endif
+static inline bool
+zmm_regs_available(void)
+{
+#ifdef HAVE_XSAVE_INTRINSICS
+ return (_xgetbv(0) & 0xe6) == 0xe6;
+#else
+ return false;
+#endif
+}
+
+/*
+ * Does CPUID say there's support for AVX-512 popcount and byte-and-word
+ * instructions?
+ */
+static inline bool
+avx512_popcnt_available(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, 7, 0);
+#else
+#error cpuid instruction not available
+#endif
+ return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
+ (exx[1] & (1 << 30)) != 0; /* avx512-bw */
+}
+
+/*
+ * Returns true if the CPU supports the instructions required for the AVX-512
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_avx512_available(void)
+{
+ return xsave_available() &&
+ zmm_regs_available() &&
+ avx512_popcnt_available();
+}
/*
* pg_popcount_avx512
* Returns the number of 1-bits in buf
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
@@ -82,6 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
* pg_popcount_masked_avx512
* Returns the number of 1-bits in buf after applying the mask to each byte
*/
+pg_attribute_target("avx512vpopcntdq", "avx512bw")
uint64
pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
{
@@ -138,4 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
return _mm512_reduce_add_epi64(accum);
}
-#endif /* TRY_POPCNT_FAST */
+#endif /* TRY_POPCNT_FAST &&
+ * USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
diff --git a/src/port/pg_popcount_avx512_choose.c b/src/port/pg_popcount_avx512_choose.c
deleted file mode 100644
index b37107803a..0000000000
--- a/src/port/pg_popcount_avx512_choose.c
+++ /dev/null
@@ -1,102 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pg_popcount_avx512_choose.c
- * Test whether we can use the AVX-512 pg_popcount() implementation.
- *
- * Copyright (c) 2024, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- * src/port/pg_popcount_avx512_choose.c
- *
- *-------------------------------------------------------------------------
- */
-#include "c.h"
-
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
-#include <cpuid.h>
-#endif
-
-#ifdef HAVE_XSAVE_INTRINSICS
-#include <immintrin.h>
-#endif
-
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
-#include <intrin.h>
-#endif
-
-#include "port/pg_bitutils.h"
-
-/*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
- * use AVX-512 intrinsics, but we check it anyway to be sure. We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
- */
-#ifdef TRY_POPCNT_FAST
-
-/*
- * Does CPUID say there's support for XSAVE instructions?
- */
-static inline bool
-xsave_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID)
- __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUID)
- __cpuid(exx, 1);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 27)) != 0; /* osxsave */
-}
-
-/*
- * Does XGETBV say the ZMM registers are enabled?
- *
- * NB: Caller is responsible for verifying that xsave_available() returns true
- * before calling this.
- */
-static inline bool
-zmm_regs_available(void)
-{
-#ifdef HAVE_XSAVE_INTRINSICS
- return (_xgetbv(0) & 0xe6) == 0xe6;
-#else
- return false;
-#endif
-}
-
-/*
- * Does CPUID say there's support for AVX-512 popcount and byte-and-word
- * instructions?
- */
-static inline bool
-avx512_popcnt_available(void)
-{
- unsigned int exx[4] = {0, 0, 0, 0};
-
-#if defined(HAVE__GET_CPUID_COUNT)
- __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
-#elif defined(HAVE__CPUIDEX)
- __cpuidex(exx, 7, 0);
-#else
-#error cpuid instruction not available
-#endif
- return (exx[2] & (1 << 14)) != 0 && /* avx512-vpopcntdq */
- (exx[1] & (1 << 30)) != 0; /* avx512-bw */
-}
-
-/*
- * Returns true if the CPU supports the instructions required for the AVX-512
- * pg_popcount() implementation.
- */
-bool
-pg_popcount_avx512_available(void)
-{
- return xsave_available() &&
- zmm_regs_available() &&
- avx512_popcnt_available();
-}
-
-#endif /* TRY_POPCNT_FAST */
--
2.39.5 (Apple Git-154)
Hi,
On 2024-11-06 20:26:47 -0600, Nathan Bossart wrote:
From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff
One thing that'd I'd like to see this being used is to elide the indirection
when the current target platform *already* supports the necessary
intrinsics. Adding a bunch of indirection for short & common operations is
decidedly not great. It doesn't have to be part of the same commit, but it
seems like it's worth doing as part of the same series, as I think it'll lead
to rather different looking configure checks.
diff --git a/src/include/c.h b/src/include/c.h index 55dec71a6d..6f5ca25542 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -174,6 +174,16 @@ #define pg_attribute_nonnull(...) #endif+/* + * pg_attribute_target allows specifying different target options that the + * function should be compiled with (e.g., for using special CPU instructions). + */ +#if __has_attribute (target) +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__))) +#else +#define pg_attribute_target(...) +#endif
Think it'd be good to mention that there still needs to be configure check to
verify that specific target attribute is understood by the compiler.
Greetings,
Andres Freund
On Thu, Nov 07, 2024 at 11:12:37AM -0500, Andres Freund wrote:
One thing that'd I'd like to see this being used is to elide the indirection
when the current target platform *already* supports the necessary
intrinsics. Adding a bunch of indirection for short & common operations is
decidedly not great. It doesn't have to be part of the same commit, but it
seems like it's worth doing as part of the same series, as I think it'll lead
to rather different looking configure checks.
The main hurdle, at least for AVX-512, is that we still need to check (at
runtime) whether the OS supports XGETBV and whether the ZMM registers are
fully enabled. We might be able to skip those checks in limited cases
(e.g., you are building on the target machine and can perhaps just check it
once at build time), but that probably won't help packagers.
+/* + * pg_attribute_target allows specifying different target options that the + * function should be compiled with (e.g., for using special CPU instructions). + */ +#if __has_attribute (target) +#define pg_attribute_target(...) __attribute__((target(__VA_ARGS__))) +#else +#define pg_attribute_target(...) +#endifThink it'd be good to mention that there still needs to be configure check to
verify that specific target attribute is understood by the compiler.
Will do.
--
nathan
On Thu, Nov 07, 2024 at 02:03:04PM -0600, Nathan Bossart wrote:
Committed.
Of course, as soon as I committed this, I noticed that it's broken. It
seems that compilers are rather picky about how multiple target options are
specified. AFAICT the commonly supported syntax is to put the entire list
within one pair of quotes and to use only commas as delimiters, i.e., no
spaces. I plan to commit the attached shortly once I've had a chance to
verify it fixes the problem on cfbot.
--
nathan
Attachments:
0001-fix-__attribute__-target-.-usage.patchtext/plain; charset=us-asciiDownload
From a5ade18a867377fa424347465bbc5f631eff4f96 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 7 Nov 2024 14:28:57 -0600
Subject: [PATCH 1/1] fix __attribute__((target(...))) usage
---
config/c-compiler.m4 | 2 +-
configure | 2 +-
meson.build | 2 +-
src/port/pg_popcount_avx512.c | 4 ++--
4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index c7eb896f14..a129edb88e 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -733,7 +733,7 @@ AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>
#if defined(__has_attribute) && __has_attribute (target)
- __attribute__((target("avx512vpopcntdq","avx512bw")))
+ __attribute__((target("avx512vpopcntdq,avx512bw")))
#endif
static int popcount_test(void)
{
diff --git a/configure b/configure
index 3a7332f834..4b01b682b1 100755
--- a/configure
+++ b/configure
@@ -17324,7 +17324,7 @@ else
/* end confdefs.h. */
#include <immintrin.h>
#if defined(__has_attribute) && __has_attribute (target)
- __attribute__((target("avx512vpopcntdq","avx512bw")))
+ __attribute__((target("avx512vpopcntdq,avx512bw")))
#endif
static int popcount_test(void)
{
diff --git a/meson.build b/meson.build
index 9eddd72a27..5b0510cef7 100644
--- a/meson.build
+++ b/meson.build
@@ -2184,7 +2184,7 @@ if host_cpu == 'x86_64'
#include <immintrin.h>
#if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("avx512vpopcntdq","avx512bw")))
+__attribute__((target("avx512vpopcntdq,avx512bw")))
#endif
int main(void)
{
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index b598e86554..1ab2847bf2 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -106,7 +106,7 @@ pg_popcount_avx512_available(void)
* pg_popcount_avx512
* Returns the number of 1-bits in buf
*/
-pg_attribute_target("avx512vpopcntdq", "avx512bw")
+pg_attribute_target("avx512vpopcntdq,avx512bw")
uint64
pg_popcount_avx512(const char *buf, int bytes)
{
@@ -162,7 +162,7 @@ pg_popcount_avx512(const char *buf, int bytes)
* pg_popcount_masked_avx512
* Returns the number of 1-bits in buf after applying the mask to each byte
*/
-pg_attribute_target("avx512vpopcntdq", "avx512bw")
+pg_attribute_target("avx512vpopcntdq,avx512bw")
uint64
pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
{
--
2.39.5 (Apple Git-154)
Of course, as soon as I committed this, I noticed that it's broken. It seems that
compilers are rather picky about how multiple target options are specified.
Just curious, which compiler complained?
Raghuveer
On Thu, Nov 07, 2024 at 08:38:21PM +0000, Devulapalli, Raghuveer wrote:
Of course, as soon as I committed this, I noticed that it's broken. It seems that
compilers are rather picky about how multiple target options are specified.Just curious, which compiler complained?
Clang.
--
nathan