Popcount optimization using AVX512

Started by Amonson, Paul Dover 2 years ago173 messages
Jump to latest
#1Amonson, Paul D
paul.d.amonson@intel.com

This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
 
Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
 
My setup:
 
Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
4. I tested 5 seeds for each input buffer size and averaged 100 runs each (5*5*100=2500 pg_popcount() calls on a single thread)
5. Data: <See Attached picture.>

The code I wrote uses the 64-bit solution or SW on the memory not aligned to a 512-bit boundary in memory:
 
///////////////////////////////////////////////////////////////////////
// 512-bit intrisic implementation (AVX512VPOPCNTDQ + AVX512F)
uint64_t popcount_512_impl(const char *bytes, int byteCount) {
#ifdef __AVX__
uint64_t result = 0;
uint64_t remainder = ((uint64_t)bytes) % 64;
result += popcount_64_impl(bytes, remainder);
byteCount -= remainder;
bytes += remainder;
uint64_t vectorCount = byteCount / 64;
remainder = byteCount % 64;
__m512i *vectors = (__m512i *)bytes;
__m512i rv;
while (vectorCount--) {
rv = _mm512_popcnt_epi64(*(vectors++));
result += _mm512_reduce_add_epi64(rv);
}
bytes = (const char *)vectors;
result += popcount_64_impl(bytes, remainder);
return result;
#else
return popcount_64_impl(bytes, byteCount);
#endif
}
 
There are further optimizations that can be applied here, but for demonstration I added the __AVX__ macro and if not fall back to the original implementations in PostgreSQL.
 
The 46% improvement in popcount is worthy of discussion considering the previous popcount 64-bit SSE and SW implementations.
 
 Thanks,
Paul Amonson

Attachments:

AVX512 Popcount Benefits.pngimage/png; name="AVX512 Popcount Benefits.png"Download
#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Amonson, Paul D (#1)
Re: Popcount optimization using AVX512

On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:

This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.

Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.

How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

My setup:

Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#3Nathan Bossart
nathandbossart@gmail.com
In reply to: Matthias van de Meent (#2)
Re: Popcount optimization using AVX512

On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote:

On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:

This proposal showcases the speed-up provided to popcount feature when
using AVX512 registers. The intent is to share the preliminary results
with the community and get feedback for adding avx512 support for
popcount.

Revisiting the previous discussion/improvements around this feature, I
have created a micro-benchmark based on the pg_popcount() in
PostgreSQL's current implementations for x86_64 using the newer AVX512
intrinsics. Playing with this implementation has improved performance up
to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will
benefit scenarios relying on popcount.

Nice. I've been testing out AVX2 support in src/include/port/simd.h, and
the results look promising there, too. I intend to start a new thread for
that (hopefully soon), but one open question I don't have a great answer
for yet is how to detect support for newer intrinsics. So far, we've been
able to use function pointers (e.g., popcount, crc32c) or deduce support
via common predefined compiler macros (e.g., we assume SSE2 is supported if
the compiler is targeting 64-bit x86). But the former introduces a
performance penalty, and we probably want to inline most of this stuff,
anyway. And the latter limits us to stuff that has been around for a
decade or two.

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Yeah, the previous optimizations in this area appear to have used ANALYZE
as the benchmark, presumably because of visibilitymap_count(). I briefly
attempted to measure the difference with and without AVX512 support, but I
haven't noticed any difference thus far. One complication for
visiblitymap_count() is that the data passed to pg_popcount64() is masked,
which requires a couple more intructions when you're using the intrinsics.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#3)
Re: Popcount optimization using AVX512

Nathan Bossart <nathandbossart@gmail.com> writes:

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization. (One
of the key things you want to be able to say, when trying to sell
a maybe-it-helps-or-maybe-it-doesnt optimization to the PG community,
is "it doesn't hurt anyone who's not able to benefit".) And you
can't argue that that overhead is negligible either, because if it
is then we're all wasting our time even discussing this. So we need
a better technology, and I fear I have no good ideas about what.

Your comment about vectorization hints at one answer: if you can
amortize the overhead across multiple applications of the operation,
then it doesn't hurt so much. But I'm not sure how often we can
make that answer work.

regards, tom lane

#5Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#4)
Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.

The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Noah Misch (#5)
Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:

On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.

The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html

Thanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0]https://llvm.org/docs/LangRef.html#ifuncs.

IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.

[0]: https://llvm.org/docs/LangRef.html#ifuncs

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#7Noah Misch
noah@leadboat.com
In reply to: Nathan Bossart (#6)
Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:

On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:

On Mon, Nov 06, 2023 at 09:52:58PM -0500, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

Yeah. The function-pointer solution kind of sucks, because for the
sort of operation we're considering here, adding a call and return
is probably order-of-100% overhead. Worse, it adds similar overhead
for everyone who doesn't get the benefit of the optimization.

The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html

Thanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0].

IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.

Agreed, it doesn't solve inline cases. If the gains are big enough, we should
move toward packages containing N CPU-specialized copies of the postgres
binary, with bin/postgres just exec'ing the right one.

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Noah Misch (#7)
Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:

On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:

On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:

The glibc/gcc "ifunc" mechanism was designed to solve this problem of choosing
a function implementation based on the runtime CPU, without incurring function
pointer overhead. I would not attempt to use AVX512 on non-glibc systems, and
I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html

Thanks, that seems promising for the function pointer cases. I'll plan on
trying to convert one of the existing ones to use it. BTW it looks like
LLVM has something similar [0].

IIUC this unfortunately wouldn't help for cases where we wanted to keep
stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
unless we applied it to the calling functions, but that doesn't ѕound
particularly maintainable.

Agreed, it doesn't solve inline cases. If the gains are big enough, we should
move toward packages containing N CPU-specialized copies of the postgres
binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses
the runtime checks for the CRC32C code, and I actually see a consistent
3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached
the patch used for testing.

The multiple-copies-of-the-postgres-binary idea seems interesting. That's
probably not something that could be enabled by default, but perhaps we
could add support for a build option.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

ifunc_test.patchtext/x-diff; charset=us-asciiDownload+7-7
#9Shankaran, Akash
akash.shankaran@intel.com
In reply to: Nathan Bossart (#8)
RE: Popcount optimization using AVX512

Sorry for the late response here. We spent some time researching and measuring the frequency impact of AVX512 instructions used here.

How does this compare to older CPUs, and more mixed workloads? IIRC,

the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

AVX512 has light and heavy instructions. While the heavy AVX512 instructions have clock frequency implications, the light instructions not so much. See [0]https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ for more details. We captured EMON data for the benchmark used in this work, and see that the instructions are using the licensing level not meant for heavy AVX512 operations. This means the instructions for popcount : _mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any significant impact on CPU clock frequency.
Clock frequency impact aside, we measured the same benchmark for gains on older Intel hardware and observe up to 18% better performance on Intel Icelake. On older intel hardware, the popcntdq 512 instruction is not present so it won’t work. If clock frequency is not affected, rest of workload should not be impacted in the case of mixed workloads.

Apart from the two type functions bytea_bit_count and bit_bit_count

(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the current 64bit behavior shows slightly lower performance, but with a large variance. We cannot conclude much from it. The testing with ANALYZE benchmark by Nathan also points to no visible impact as a result of using AVX512. The gains on larger dataset is easily evident, with less variance.
What are your thoughts if we introduce AVX512 popcount for smaller sizes as an optional feature initially, and then test it more thoroughly over time on this particular use case?

Regarding enablement, following the other responses related to function inlining, using ifunc and enabling future intrinsic support, it seems a concrete solution would require further discussion. We’re attaching a patch to enable AVX512, which can use AVX512 flags during build. For example:

make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl -march=icelake-server -DAVX512_POPCNT=1"

Thoughts or feedback on the approach in the patch? This solution should not impact anyone who doesn’t use the feature i.e. AVX512. Open to additional ideas if this doesn’t seem like the right approach here.

[0]: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Tuesday, November 7, 2023 12:15 PM
To: Noah Misch <noah@leadboat.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; Amonson, Paul D <paul.d.amonson@intel.com>; pgsql-hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
Subject: Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:

On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:

On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:

The glibc/gcc "ifunc" mechanism was designed to solve this problem
of choosing a function implementation based on the runtime CPU,
without incurring function pointer overhead. I would not attempt
to use AVX512 on non-glibc systems, and I would use ifunc to select the desired popcount implementation on glibc:
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht
ml

Thanks, that seems promising for the function pointer cases. I'll
plan on trying to convert one of the existing ones to use it. BTW it
looks like LLVM has something similar [0].

IIUC this unfortunately wouldn't help for cases where we wanted to
keep stuff inlined, such as is_valid_ascii() and the functions in
pg_lfind.h, unless we applied it to the calling functions, but that
doesn't ѕound particularly maintainable.

Agreed, it doesn't solve inline cases. If the gains are big enough,
we should move toward packages containing N CPU-specialized copies of
the postgres binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and I actually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached the patch used for testing.

The multiple-copies-of-the-postgres-binary idea seems interesting. That's probably not something that could be enabled by default, but perhaps we could add support for a build option.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

proposed_popcnt.patchapplication/octet-stream; name=proposed_popcnt.patchDownload+95-28
#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Shankaran, Akash (#9)
Re: Popcount optimization using AVX512

On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:

AVX512 has light and heavy instructions. While the heavy AVX512
instructions have clock frequency implications, the light instructions
not so much. See [0] for more details. We captured EMON data for the
benchmark used in this work, and see that the instructions are using the
licensing level not meant for heavy AVX512 operations. This means the
instructions for popcount : _mm512_popcnt_epi64(),
_mm512_reduce_add_epi64() are not going to have any significant impact on
CPU clock frequency.

Clock frequency impact aside, we measured the same benchmark for gains on
older Intel hardware and observe up to 18% better performance on Intel
Icelake. On older intel hardware, the popcntdq 512 instruction is not
present so it won’t work. If clock frequency is not affected, rest of
workload should not be impacted in the case of mixed workloads.

Thanks for sharing your analysis.

Testing this on smaller block sizes < 8KiB shows that AVX512 compared to
the current 64bit behavior shows slightly lower performance, but with a
large variance. We cannot conclude much from it. The testing with ANALYZE
benchmark by Nathan also points to no visible impact as a result of using
AVX512. The gains on larger dataset is easily evident, with less
variance.

What are your thoughts if we introduce AVX512 popcount for smaller sizes
as an optional feature initially, and then test it more thoroughly over
time on this particular use case?

I don't see any need to rush this. At the very earliest, this feature
would go into v17, which doesn't enter feature freeze until April 2024.
That seems like enough time to complete any additional testing you'd like
to do. However, if you are seeing worse performance with this patch, then
it seems unlikely that we'd want to proceed.

Thoughts or feedback on the approach in the patch? This solution should
not impact anyone who doesn’t use the feature i.e. AVX512. Open to
additional ideas if this doesn’t seem like the right approach here.

It's true that it wouldn't impact anyone not using the feature, but there's
also a decent chance that this code goes virtually untested. As I've
stated elsewhere [0]/messages/by-id/20230726043707.GB3211130@nathanxps13, I think we should ensure there's buildfarm coverage
for this kind of architecture-specific stuff.

[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#11Shankaran, Akash
akash.shankaran@intel.com
In reply to: Nathan Bossart (#10)
RE: Popcount optimization using AVX512

Sorry for the late response. We did some further testing and research on our end, and ended up modifying the AVX512 based algorithm for popcount. We removed a scalar dependency and accumulate the results of popcnt instruction in a zmm register, only performing the reduce add at the very end, similar to [0]/messages/by-id/20230726043707.GB3211130@nathanxps13.

With the updated patch, we observed significant improvements and handily beat the previous popcount algorithm performance. No regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same microbenchmark described initially in this thread.

PG testing:
SQL bit_count() calls popcount. Using a Postgres benchmark calling "select bit_count(bytea(col1)) from mytable" on a table with ~2M text rows, each row 1-12kb in size, we observe (only comparing with 64bit PG implementation, which is the fastest):

1. Entire benchmark using AVX512 implementation vs PG 64-bit impl runs 6-13% faster.
2. Reduce time spent on pg_popcount() method in postgres server during the benchmark:
o 64bit (current PG): 29.5%
o AVX512: 3.3%
3. Reduce number of samples processed by popcount:
o 64bit (current PG): 2.4B samples
o AVX512: 285M samples

Compile above patch (on a machine supporting AVX512 vpopcntdq) using: make all CFLAGS_AVX512="-DHAVE__HW_AVX512_POPCNT -mavx -mavx512vpopcntdq -mavx512f -march=native
Attaching flamegraphs and patch for above observations.

[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13

Thanks,
Akash Shankaran

-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, November 15, 2023 1:49 PM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Noah Misch <noah@leadboat.com>; Amonson, Paul D <paul.d.amonson@intel.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:

AVX512 has light and heavy instructions. While the heavy AVX512
instructions have clock frequency implications, the light instructions
not so much. See [0] for more details. We captured EMON data for the
benchmark used in this work, and see that the instructions are using
the licensing level not meant for heavy AVX512 operations. This means
the instructions for popcount : _mm512_popcnt_epi64(),
_mm512_reduce_add_epi64() are not going to have any significant impact
on CPU clock frequency.

Clock frequency impact aside, we measured the same benchmark for gains
on older Intel hardware and observe up to 18% better performance on
Intel Icelake. On older intel hardware, the popcntdq 512 instruction
is not present so it won’t work. If clock frequency is not affected,
rest of workload should not be impacted in the case of mixed workloads.

Thanks for sharing your analysis.

Testing this on smaller block sizes < 8KiB shows that AVX512 compared
to the current 64bit behavior shows slightly lower performance, but
with a large variance. We cannot conclude much from it. The testing
with ANALYZE benchmark by Nathan also points to no visible impact as a
result of using AVX512. The gains on larger dataset is easily evident,
with less variance.

What are your thoughts if we introduce AVX512 popcount for smaller
sizes as an optional feature initially, and then test it more
thoroughly over time on this particular use case?

I don't see any need to rush this. At the very earliest, this feature would go into v17, which doesn't enter feature freeze until April 2024.
That seems like enough time to complete any additional testing you'd like to do. However, if you are seeing worse performance with this patch, then it seems unlikely that we'd want to proceed.

Thoughts or feedback on the approach in the patch? This solution
should not impact anyone who doesn’t use the feature i.e. AVX512. Open
to additional ideas if this doesn’t seem like the right approach here.

It's true that it wouldn't impact anyone not using the feature, but there's also a decent chance that this code goes virtually untested. As I've stated elsewhere [0]/messages/by-id/20230726043707.GB3211130@nathanxps13, I think we should ensure there's buildfarm coverage for this kind of architecture-specific stuff.

[0]: /messages/by-id/20230726043707.GB3211130@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

perf-avx512-1.8mrows.svgapplication/octet-stream; name=perf-avx512-1.8mrows.svgDownload
perf-with-64bit-1.8m.svgapplication/octet-stream; name=perf-with-64bit-1.8m.svgDownload
popcount_avx512.patchapplication/octet-stream; name=popcount_avx512.patchDownload+27-0
#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Shankaran, Akash (#11)
Re: Popcount optimization using AVX512

On 2024-Jan-25, Shankaran, Akash wrote:

With the updated patch, we observed significant improvements and
handily beat the previous popcount algorithm performance. No
regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same
microbenchmark described initially in this thread.

These are great results.

However, it would be much better if the improved code were available for
all relevant builds and activated if a CPUID test determines that the
relevant instructions are available, instead of requiring a compile-time
flag -- which most builds are not going to use, thus wasting the
opportunity for running the optimized code.

I suppose this would require patching pg_popcount64_choose() to be more
specific. Looking at the existing code, I would also consider renaming
the "_fast" variants to something like pg_popcount32_asml/
pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq
or such. (Or maybe leave the 32-bit version alone as "fast/slow", since
there's no third option for that one -- or do I misread?)

I also think this needs to move the CFLAGS-decision-making elsewhere;
asking the user to get it right is too much of a burden. Is it workable
to simply verify compiler support for the additional flags needed, and
if so add them to a new CFLAGS_BITUTILS variable or such? We already
have the CFLAGS_CRC model that should be easy to follow. Should be easy
enough to mostly copy what's in configure.ac and meson.build, right?

Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family. Given that
Nathan reports [1]/messages/by-id/20231107201441.GA898662@nathanxps13 a performance decrease, maybe we should set that
thought aside for now and continue to use function pointers. It's worth
keeping in mind that popcount is already using function pointers (at
least in the case where we try to use POPCNT directly), so patching to
select between three options instead of between two wouldn't be a
regression.

[1]: /messages/by-id/20231107201441.GA898662@nathanxps13

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#12)
Re: Popcount optimization using AVX512

On 2024-Jan-25, Alvaro Herrera wrote:

Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family.

Oh, I just realized that the patch as currently proposed is placing the
optimized popcount code in the path that does not require going through
a function pointer. So the performance increase is probably coming from
both avoiding jumping through the pointer as well as from the improved
instruction.

This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
/messages/by-id/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net

#14Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Alvaro Herrera (#12)
RE: Popcount optimization using AVX512

Hi All,

However, it would be much better if the improved code were available for
all relevant builds and activated if a CPUID test determines that the
relevant instructions are available, instead of requiring a compile-time
flag -- which most builds are not going to use, thus wasting the
opportunity for running the optimized code.

This makes sense. I addressed the feedback, and am attaching an updated patch. Patch also addresses your feedback of autconf configurations by adding CFLAG support. I tested the runtime check for AVX512 on multiple processors with and without AVX512 and it detected or failed to detect the feature as expected.

Looking at the existing code, I would also consider renaming
the "_fast" variants to something like pg_popcount32_asml/
pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq
or such.

I left out the renaming, as it made sense to keep the fast/slow naming for readability.

Finally, the matter of using ifunc as proposed by Noah seems to be still
in the air, with no patches offered for the popcount family. Given that
Nathan reports [1] a performance decrease, maybe we should set that
thought aside for now and continue to use function pointers.

Since there are improvements without it (results below), I agree with you to continue using function pointers.

I collected data on machines with, and without AVX512 support, using a table with 1M rows and performing SQL bit_count() on a char column containing (84bytes, 4KiB, 8KiB, 16KiB).
* On non-AVX 512 hardware: no regression or impact at runtime with code built with AVX 512 support in the binary between the patched and unpatched servers.
* On AVX512 hardware: the max improvement I saw was 17% but was averaged closer to 6.5% on a bare-metal machine. The benefit is lower on smaller cloud VMs on AWS (1 - 3%)

If the patch looks good, please suggest next steps on committing it.

Paul

-----Original Message-----
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Sent: Thursday, January 25, 2024 1:49 AM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Amonson, Paul D <paul.d.amonson@intel.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

On 2024-Jan-25, Shankaran, Akash wrote:

With the updated patch, we observed significant improvements and
handily beat the previous popcount algorithm performance. No
regressions in any scenario are observed:
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same
microbenchmark described initially in this thread.

These are great results.

However, it would be much better if the improved code were available for all relevant builds and activated if a CPUID test determines that the relevant instructions are available, instead of requiring a compile-time flag -- which most builds are not going to use, thus wasting the opportunity for running the optimized code.

I suppose this would require patching pg_popcount64_choose() to be more specific. Looking at the existing code, I would also consider renaming the "_fast" variants to something like pg_popcount32_asml/ pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq or such. (Or maybe leave the 32-bit version alone as "fast/slow", since there's no third option for that one -- or do I misread?)

I also think this needs to move the CFLAGS-decision-making elsewhere; asking the user to get it right is too much of a burden. Is it workable to simply verify compiler support for the additional flags needed, and if so add them to a new CFLAGS_BITUTILS variable or such? We already have the CFLAGS_CRC model that should be easy to follow. Should be easy enough to mostly copy what's in configure.ac and meson.build, right?

Finally, the matter of using ifunc as proposed by Noah seems to be still in the air, with no patches offered for the popcount family. Given that Nathan reports [1]/messages/by-id/20231107201441.GA898662@nathanxps13 a performance decrease, maybe we should set that thought aside for now and continue to use function pointers. It's worth keeping in mind that popcount is already using function pointers (at least in the case where we try to use POPCNT directly), so patching to select between three options instead of between two wouldn't be a regression.

[1]: /messages/by-id/20231107201441.GA898662@nathanxps13

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)

Attachments:

Official-PostgreSQL-AVX-512-POPCNT.patchapplication/octet-stream; name=Official-PostgreSQL-AVX-512-POPCNT.patchDownload+256-21
#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amonson, Paul D (#14)
Re: Popcount optimization using AVX512

Hello,

This looks quite reasonable. On my machine, I get the compiler test to
pass so I get a "yes" in configure; but of course my CPU doesn't support
the instructions so I get the slow variant. So here's the patch again
with some minor artifacts fixed.

I have the following review notes:

1. we use __get_cpuid_count and __cpuidex by relying on macros
HAVE__GET_CPUID and HAVE__CPUID respectively; but those macros are (in
the current Postgres source) only used and tested for __get_cpuid and
__cpuid respectively. So unless there's some reason to be certain that
__get_cpuid_count is always present when __get_cpuid is present, and
that __cpuidex is present when __cpuid is present, I think we need to
add new configure tests and new HAVE_ macros for these.

2. we rely on <immintrin.h> being present with no AC_CHECK_HEADER()
test. We currently don't use this header anywhere, so I suppose we need
a test for this one as well. (Also, I suppose if we don't have
immintrin.h we can skip the rest of it?)

3. We do the __get_cpuid_count/__cpuidex test and we also do a xgetbv
test. The comment there claims that this is to check the results for
consistency. But ... how would we know that the results are ever
inconsistent? As far as I understand, if they were, we would silently
become slower. Is this really what we want? I'm confused about this
coding. Maybe we do need both tests to succeed? In that case, just
reword the comment.

I think if both tests are each considered reliable on its own, then we
could either choose one of them and stick with it, ignoring the other;
or we could use one as primary and then in a USE_ASSERT_CHECKING block
verify that the other matches and throw a WARNING if not (but what would
that tell us?). Or something like that ... not sure.

4. It needs meson support, which I suppose consists of copying the
c-compiler.m4 test into meson.build, mimicking what the tests for CRC
instructions do.

I started a CI run with this patch applied,
https://cirrus-ci.com/build/4912499619790848
but because Meson support is missing, the compile failed
immediately:

[10:08:48.825] ccache cc -Isrc/port/libpgport_srv.a.p -Isrc/include -I../src/include -Isrc/include/utils -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -pthread -DBUILDING_DLL -MD -MQ src/port/libpgport_srv.a.p/pg_bitutils.c.o -MF src/port/libpgport_srv.a.p/pg_bitutils.c.o.d -o src/port/libpgport_srv.a.p/pg_bitutils.c.o -c ../src/port/pg_bitutils.c
[10:08:48.825] ../src/port/pg_bitutils.c: In function ‘pg_popcount512_fast’:
[10:08:48.825] ../src/port/pg_bitutils.c:270:11: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~
[10:08:48.825] In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:55,
[10:08:48.825] from ../src/port/pg_bitutils.c:22:
[10:08:48.825] /usr/lib/gcc/x86_64-linux-gnu/10/include/avx512fintrin.h:339:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_setzero_si512’: target specific option mismatch
[10:08:48.825] 339 | _mm512_setzero_si512 (void)
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~
[10:08:48.825] ../src/port/pg_bitutils.c:270:25: note: called from here
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~~~

Thanks

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Attachments:

v3-0001-Add-support-for-AVX512-implemented-POPCNT.patchtext/x-diff; charset=utf-8Download+247-14
#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#15)
Re: Popcount optimization using AVX512

I happened to notice by chance that John Naylor had posted an extension
to measure performance of popcount here:
/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

This might be useful as a base for a new one to verify the results of
the proposed patch in machines with relevant instruction support.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"We're here to devour each other alive" (Hobbes)

#17Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Alvaro Herrera (#15)
RE: Popcount optimization using AVX512

Álvaro,

All feedback is now completed. I added the additional checks for the new APIs and a separate check for the header to autoconf.

About the double check for AVX 512 I added a large comment explaining why both are needed. There are cases where the CPU ZMM# registers are not exposed by the OS or hypervisor even if the CPU supports AVX512.

The big change is adding all old and new build support to meson. I am new to meson/ninja so please review carefully.

Thanks,
Paul

-----Original Message-----
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Sent: Wednesday, February 7, 2024 2:13 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

Hello,

This looks quite reasonable. On my machine, I get the compiler test to pass so I get a "yes" in configure; but of course my CPU doesn't support the instructions so I get the slow variant. So here's the patch again with some minor artifacts fixed.

I have the following review notes:

1. we use __get_cpuid_count and __cpuidex by relying on macros HAVE__GET_CPUID and HAVE__CPUID respectively; but those macros are (in the current Postgres source) only used and tested for __get_cpuid and __cpuid respectively. So unless there's some reason to be certain that __get_cpuid_count is always present when __get_cpuid is present, and that __cpuidex is present when __cpuid is present, I think we need to add new configure tests and new HAVE_ macros for these.

2. we rely on <immintrin.h> being present with no AC_CHECK_HEADER() test. We currently don't use this header anywhere, so I suppose we need a test for this one as well. (Also, I suppose if we don't have immintrin.h we can skip the rest of it?)

3. We do the __get_cpuid_count/__cpuidex test and we also do a xgetbv test. The comment there claims that this is to check the results for consistency. But ... how would we know that the results are ever inconsistent? As far as I understand, if they were, we would silently become slower. Is this really what we want? I'm confused about this coding. Maybe we do need both tests to succeed? In that case, just reword the comment.

I think if both tests are each considered reliable on its own, then we could either choose one of them and stick with it, ignoring the other; or we could use one as primary and then in a USE_ASSERT_CHECKING block verify that the other matches and throw a WARNING if not (but what would that tell us?). Or something like that ... not sure.

4. It needs meson support, which I suppose consists of copying the
c-compiler.m4 test into meson.build, mimicking what the tests for CRC instructions do.

I started a CI run with this patch applied,
https://cirrus-ci.com/build/4912499619790848
but because Meson support is missing, the compile failed
immediately:

[10:08:48.825] ccache cc -Isrc/port/libpgport_srv.a.p -Isrc/include -I../src/include -Isrc/include/utils -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -pthread -DBUILDING_DLL -MD -MQ src/port/libpgport_srv.a.p/pg_bitutils.c.o -MF src/port/libpgport_srv.a.p/pg_bitutils.c.o.d -o src/port/libpgport_srv.a.p/pg_bitutils.c.o -c ../src/port/pg_bitutils.c [10:08:48.825] ../src/port/pg_bitutils.c: In function ‘pg_popcount512_fast’:
[10:08:48.825] ../src/port/pg_bitutils.c:270:11: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~
[10:08:48.825] In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:55,
[10:08:48.825] from ../src/port/pg_bitutils.c:22:
[10:08:48.825] /usr/lib/gcc/x86_64-linux-gnu/10/include/avx512fintrin.h:339:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_setzero_si512’: target specific option mismatch
[10:08:48.825] 339 | _mm512_setzero_si512 (void)
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~
[10:08:48.825] ../src/port/pg_bitutils.c:270:25: note: called from here
[10:08:48.825] 270 | __m512i accumulator = _mm512_setzero_si512();
[10:08:48.825] | ^~~~~~~~~~~~~~~~~~~~~~

Thanks

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Attachments:

v4-0001-Add-support-for-AVX512-implemented-POPCNT.patchapplication/octet-stream; name=v4-0001-Add-support-for-AVX512-implemented-POPCNT.patchDownload+476-18
#18Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#13)
Re: Popcount optimization using AVX512

Hi,

On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:

This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.

Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.

Greetings,

Andres

#19Andres Freund
andres@anarazel.de
In reply to: Amonson, Paul D (#17)
Re: Popcount optimization using AVX512

Hi,

On 2024-02-09 17:39:46 +0000, Amonson, Paul D wrote:

diff --git a/meson.build b/meson.build
index 8ed51b6aae..1e7a4dc942 100644
--- a/meson.build
+++ b/meson.build
@@ -1773,6 +1773,45 @@ elif cc.links('''
endif
+# XXX: The configure.ac check for __cpuidex() is broken, we don't copy that
+# here. To prevent problems due to two detection methods working, stop
+# checking after one.

This seems like a bogus copy-paste.

+if cc.links('''
+    #include <cpuid.h>
+    int main(int arg, char **argv)
+    {
+        unsigned int exx[4] = {0, 0, 0, 0};
+        __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+    }
+    ''', name: '__get_cpuid_count',
+    args: test_c_args)
+  cdata.set('HAVE__GET_CPUID_COUNT', 1)
+elif cc.links('''
+    #include <intrin.h>
+    int main(int arg, char **argv)
+    {
+        unsigned int exx[4] = {0, 0, 0, 0};
+        __cpuidex(exx, 7, 0);
+    }
+    ''', name: '__cpuidex',
+    args: test_c_args)
+  cdata.set('HAVE__CPUIDEX', 1)
+endif
+
+
+# Check for header immintrin.h
+if cc.links('''
+    #include <immintrin.h>
+    int main(int arg, char **argv)
+    {
+      return 1701;
+    }
+    ''', name: '__immintrin',
+    args: test_c_args)
+  cdata.set('HAVE__IMMINTRIN', 1)
+endif

Do these all actually have to link? Invoking the linker is slow.

I think you might be able to just use cc.has_header_symbol().

+###############################################################
+# AVX 512 POPCNT Intrinsic check
+###############################################################
+have_avx512_popcnt = false
+cflags_avx512_popcnt = []
+if host_cpu == 'x86_64'
+  prog = '''
+      #include <immintrin.h>
+      #include <stdint.h>
+      void main(void)
+      {
+        __m512i tmp __attribute__((aligned(64)));
+        __m512i input = _mm512_setzero_si512();
+        __m512i output = _mm512_popcnt_epi64(input);
+        uint64_t cnt = 999;
+        _mm512_store_si512(&tmp, output);
+        cnt = _mm512_reduce_add_epi64(tmp);
+        /* return computed value, to prevent the above being optimized away */
+        return cnt == 0;
+      }'''

Does this work with msvc?

+ if cc.links(prog, name: '_mm512_setzero_si512, _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64 with -mavx512vpopcntdq -mavx512f',

That's a very long line in the output, how about using the avx feature name or
something?

diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..6a01a7d89a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)
+pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
+
# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
diff --git a/src/port/meson.build b/src/port/meson.build
index 69b30ab21b..1c48a3b07e 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -184,6 +184,7 @@ foreach name, opts : pgport_variants
link_with: cflag_libs,
c_pch: pch_c_h,
kwargs: opts + {
+        'c_args': opts.get('c_args', []) + cflags_avx512_popcnt,
'dependencies': opts['dependencies'] + [ssl],
}
)

This will build all of pgport with the avx flags, which wouldn't be correct, I
think? The compiler might inject automatic uses of avx512 in places, which
would cause problems, no?

While you don't do the same for make, isn't even just using the avx512 for all
of pg_bitutils.c broken for exactly that reson? That's why the existing code
builds the files for various crc variants as their own file.

Greetings,

Andres Freund

#20Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#18)
Re: Popcount optimization using AVX512

On Fri, Feb 09, 2024 at 10:24:32AM -0800, Andres Freund wrote:

On 2024-01-26 07:42:33 +0100, Alvaro Herrera wrote:

This suggests that finding a way to make the ifunc stuff work (with good
performance) is critical to this work.

Ifuncs are effectively implemented as a function call via a pointer, they're
not magic, unfortunately. The sole trick they provide is that you don't
manually have to use the function pointer.

The IFUNC creators introduced it so glibc could use arch-specific memcpy with
the instruction sequence of a non-pointer, extern function call, not the
instruction sequence of a function pointer call. I don't know why the
upthread ifunc_test.patch benchmark found ifunc performing worse than function
pointers. However, it would be odd if toolchains have replaced the original
IFUNC with something equivalent to or slower than function pointers.

#21Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#20)
#22Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#21)
#23Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Andres Freund (#19)
#24Andres Freund
andres@anarazel.de
In reply to: Amonson, Paul D (#23)
#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Noah Misch (#22)
#26Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Andres Freund (#24)
#27Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#26)
#28Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#27)
#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#28)
#30Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#29)
#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#30)
#32Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#31)
#33Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#32)
#34Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Amonson, Paul D (#34)
#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#34)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amonson, Paul D (#30)
#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#37)
#39Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#36)
#40Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#39)
#41Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#40)
#42Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#41)
#43Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#42)
#44Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#43)
#45Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#41)
#46Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#45)
#47Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#46)
#48Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#47)
#49David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#46)
#50Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#49)
#51Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#50)
#52Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#51)
#53Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#52)
#54Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#52)
#55Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#53)
#56Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#54)
#57David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#54)
#58Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#57)
#59Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#58)
#60Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#59)
#61David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#58)
#62Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#61)
#63Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#62)
#64David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#63)
#65Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#64)
#66Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#65)
#67David Rowley
dgrowleyml@gmail.com
In reply to: Amonson, Paul D (#66)
#68Amonson, Paul D
paul.d.amonson@intel.com
In reply to: David Rowley (#67)
#69David Rowley
dgrowleyml@gmail.com
In reply to: Amonson, Paul D (#66)
#70Amonson, Paul D
paul.d.amonson@intel.com
In reply to: David Rowley (#69)
#71Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#70)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amonson, Paul D (#71)
#73Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Tom Lane (#72)
#74Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#72)
#75Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#73)
#76Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#75)
#77Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#76)
#78Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#77)
#79Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#78)
#80Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#79)
#81Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#79)
#82Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amonson, Paul D (#81)
#83Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Amonson, Paul D (#81)
#84Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#81)
#85Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#82)
#86Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#83)
#87Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#84)
#88Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#87)
#89Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#86)
#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#88)
#91Shankaran, Akash
akash.shankaran@intel.com
In reply to: Nathan Bossart (#88)
#92Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#90)
#93Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#85)
#94Amonson, Paul D
paul.d.amonson@intel.com
In reply to: Nathan Bossart (#89)
#95Nathan Bossart
nathandbossart@gmail.com
In reply to: Amonson, Paul D (#94)
#96Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#95)
#97Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#96)
#98Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#97)
#99Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#98)
#100Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#99)
#101Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#100)
#102Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#101)
#103Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#102)
#104Nathan Bossart
nathandbossart@gmail.com
In reply to: Ants Aasma (#103)
#105Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#104)
#106Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#104)
#107Nathan Bossart
nathandbossart@gmail.com
In reply to: Ants Aasma (#105)
#108Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#106)
#109Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Nathan Bossart (#108)
#110Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#109)
#111Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#110)
#112Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#104)
#113Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#111)
#114Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#113)
#115Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#114)
#116Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#115)
#117Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#116)
#118Nathan Bossart
nathandbossart@gmail.com
In reply to: Ants Aasma (#112)
#119David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#118)
#120Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#118)
#121Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#119)
#122Nathan Bossart
nathandbossart@gmail.com
In reply to: Ants Aasma (#120)
#123Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#122)
#124Ants Aasma
ants.aasma@cybertec.at
In reply to: Nathan Bossart (#123)
#125Nathan Bossart
nathandbossart@gmail.com
In reply to: Ants Aasma (#124)
#126Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#125)
#127David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#126)
#128Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#127)
#129David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#128)
#130Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#129)
#131Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#130)
#132Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#130)
#133Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#132)
#134Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#133)
#135Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#134)
#136Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#135)
#137Shankaran, Akash
akash.shankaran@intel.com
In reply to: Nathan Bossart (#136)
#138Nathan Bossart
nathandbossart@gmail.com
In reply to: Shankaran, Akash (#137)
#139Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#129)
#140Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#139)
#141Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#140)
#142Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#141)
#143Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#142)
#144Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#143)
#145Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#144)
#146Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#145)
#147Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#146)
#148Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#146)
#149Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#148)
#150Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#148)
#151Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#150)
#152Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#151)
#153Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#152)
#154Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#153)
#155Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#154)
#156Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#155)
#157Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#156)
#158Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#157)
#159Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#158)
#160Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#159)
#161Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Devulapalli, Raghuveer (#160)
#162Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#161)
#163Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#162)
#164Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#162)
#165Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#164)
#166Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#165)
#167Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#166)
#168Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#167)
#169Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#168)
#170Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#169)
#171Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#170)
#172Devulapalli, Raghuveer
raghuveer.devulapalli@intel.com
In reply to: Nathan Bossart (#171)
#173Nathan Bossart
nathandbossart@gmail.com
In reply to: Devulapalli, Raghuveer (#172)