Popcount optimization using AVX512

Started by Amonson, Paul Dover 2 years ago173 messageshackers

paul.d.amonson@intel.com

over 2 years ago

This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.

Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.

My setup:

Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
4. I tested 5 seeds for each input buffer size and averaged 100 runs each (5*5*100=2500 pg_popcount() calls on a single thread)
5. Data: <See Attached picture.>

The code I wrote uses the 64-bit solution or SW on the memory not aligned to a 512-bit boundary in memory:

///////////////////////////////////////////////////////////////////////
// 512-bit intrisic implementation (AVX512VPOPCNTDQ + AVX512F)
uint64_t popcount_512_impl(const char *bytes, int byteCount) {
#ifdef __AVX__
uint64_t result = 0;
uint64_t remainder = ((uint64_t)bytes) % 64;
result += popcount_64_impl(bytes, remainder);
byteCount -= remainder;
bytes += remainder;
uint64_t vectorCount = byteCount / 64;
remainder = byteCount % 64;
__m512i *vectors = (__m512i *)bytes;
__m512i rv;
while (vectorCount--) {
rv = _mm512_popcnt_epi64(*(vectors++));
result += _mm512_reduce_add_epi64(rv);
}
bytes = (const char *)vectors;
result += popcount_64_impl(bytes, remainder);
return result;
#else
return popcount_64_impl(bytes, byteCount);
#endif
}

There are further optimizations that can be applied here, but for demonstration I added the __AVX__ macro and if not fall back to the original implementations in PostgreSQL.

The 46% improvement in popcount is worthy of discussion considering the previous popcount 64-bit SSE and SW implementations.

Thanks,
Paul Amonson

Matthias van de Meent

boekewurm+postgres@gmail.com

over 2 years ago

In reply to: Amonson, Paul D (#1)

Re: Popcount optimization using AVX512

On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:

This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.

Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.

How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

My setup:

Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Popcount optimization using AVX512

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: