Add RISC-V Zbb popcount optimization

Started by Greg Burd2 months ago20 messageshackers
Jump to latest
#1Greg Burd
greg@burd.me

Hello.

Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

best.

-greg

Attachments:

v1-0001-Add-RISC-V-Zbb-popcount-optimization.patchtext/x-patch; name="=?UTF-8?Q?v1-0001-Add-RISC-V-Zbb-popcount-optimization.patch?="Download+91-3
#2Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#1)
Re: Add RISC-V Zbb popcount optimization

Hi,

On 2026-03-21 12:54:10 -0400, Greg Burd wrote:

Attached is a small patch that enables hardware popcount on RISC-V when
available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

Maybe I'm missing something: How is the latter approach safe without a runtime
check? Just because it compiled on the build machine with -march=rv64gc_zbb
added doesn't mean it runs on either the build machine or any other machine?

If this worked, the compiler could just always specify -march=rv64gc_zbb, no?

Greetings,

Andres Freund

#3Greg Burd
greg@burd.me
In reply to: Andres Freund (#2)
Re: Add RISC-V Zbb popcount optimization

On Sat, Mar 21, 2026, at 2:36 PM, Andres Freund wrote:

Hi,

On 2026-03-21 12:54:10 -0400, Greg Burd wrote:

Attached is a small patch that enables hardware popcount on RISC-V when
available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

Maybe I'm missing something: How is the latter approach safe without a runtime
check? Just because it compiled on the build machine with -march=rv64gc_zbb
added doesn't mean it runs on either the build machine or any other machine?

If this worked, the compiler could just always specify -march=rv64gc_zbb, no?

Hey Andres, thanks for taking a look.

You are correct, mea culpa for not catching this before I sent it out. If the second test succeeds the patch will add `-march=rv64gc_zbb` to `CFLAGS` globally, which means without the runtime check the binary will crash with SIGILL on systems without Zbb.

I'll rework... :)

Greetings,

Andres Freund

best.

-greg

#4John Naylor
john.naylor@enterprisedb.com
In reply to: Greg Burd (#1)
Re: Add RISC-V Zbb popcount optimization

On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote:

Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

I have to ask what the point is -- isn't that like putting a 4-inch
exhaust tip on a go-kart?

--
John Naylor
Amazon Web Services

#5Greg Burd
greg@burd.me
In reply to: John Naylor (#4)
Re: Add RISC-V Zbb popcount optimization

On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote:

On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote:

Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

I have to ask what the point is -- isn't that like putting a 4-inch
exhaust tip on a go-kart?

Hey John,

The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P

gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb
sw popcount: 0.196 sec ( 510.08 MB/s)
hw popcount: 0.293 sec ( 341.48 MB/s)

diff: 0.67x
match: 406261900 bits counted
sw popcount: 0.182 sec ( 548.86 MB/s)
hw popcount: 0.044 sec ( 2279.89 MB/s)

diff: 4.15x
match: 406261900 bits counted

But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program.

--
John Naylor
Amazon Web Services

best.

-greg

Attachments:

riscv-popcnt.ctext/x-csrc; name=riscv-popcnt.cDownload
v2-0001-Add-RISC-V-Zbb-popcount-optimization.patchtext/x-patch; name="=?UTF-8?Q?v2-0001-Add-RISC-V-Zbb-popcount-optimization.patch?="Download+229-9
#6Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#5)
Re: Add RISC-V Zbb popcount optimization

Hi,

On 2026-03-22 13:43:43 -0400, Greg Burd wrote:

On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote:

On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote:

Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

I have to ask what the point is -- isn't that like putting a 4-inch
exhaust tip on a go-kart?

The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P

gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb
sw popcount: 0.196 sec ( 510.08 MB/s)
hw popcount: 0.293 sec ( 341.48 MB/s)

diff: 0.67x
match: 406261900 bits counted
sw popcount: 0.182 sec ( 548.86 MB/s)
hw popcount: 0.044 sec ( 2279.89 MB/s)

diff: 4.15x
match: 406261900 bits counted

But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program.

Sure, but what PG workloads are actually affected to a meaningful degree by
this? And are those, on riscv, actually most bottlenecked by popcount
performance?

I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
all that effectively - hard to believe there's any real world workloads where
that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
world use of those platforms, making niche-y perf improvements somewhat
worthwhile. Whereas there's afaict not yet a whole lot of riscv production
adoption.

Once you add CPU dispatch to the cost it gets a heck of a lot less clearly
worthwhile. You need heuristics to decide when the dispatch cost is worth it
and even then it's going to slow down your non-worthwhile case somewhat.

That's one of the things that make's riscv's decision to put so many crucial
features into optional extensions so annoying for people that write
non-embedded software.

- Andres

#7Greg Burd
greg@burd.me
In reply to: Andres Freund (#6)
Re: Add RISC-V Zbb popcount optimization

On Sun, Mar 22, 2026, at 2:01 PM, Andres Freund wrote:

Hi,

On 2026-03-22 13:43:43 -0400, Greg Burd wrote:

On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote:

On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote:

Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate.

I have to ask what the point is -- isn't that like putting a 4-inch
exhaust tip on a go-kart?

The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P

gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c
gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb
sw popcount: 0.196 sec ( 510.08 MB/s)
hw popcount: 0.293 sec ( 341.48 MB/s)

diff: 0.67x
match: 406261900 bits counted
sw popcount: 0.182 sec ( 548.86 MB/s)
hw popcount: 0.044 sec ( 2279.89 MB/s)

diff: 4.15x
match: 406261900 bits counted

But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program.

Sure, but what PG workloads are actually affected to a meaningful degree by
this? And are those, on riscv, actually most bottlenecked by popcount
performance?

I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
all that effectively - hard to believe there's any real world workloads where
that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
world use of those platforms, making niche-y perf improvements somewhat
worthwhile. Whereas there's afaict not yet a whole lot of riscv production
adoption.

Once you add CPU dispatch to the cost it gets a heck of a lot less clearly
worthwhile. You need heuristics to decide when the dispatch cost is worth it
and even then it's going to slow down your non-worthwhile case somewhat.

That's one of the things that make's riscv's decision to put so many crucial
features into optional extensions so annoying for people that write
non-embedded software.

Hey Andres,

All fair points. RISC-V is annoying, the idea of CPU extensions is just one reason. To be honest, I'm not sure it is worth it either! That said, this patch isn't a huge "squeeze" (or unprecedented) and it does provide some "juice" (4x faster). It has the shape of the ARM equivalent, so to me it fell into that category of things we'd commit.

But I get it, as I said to start - all fair points.

- Andres

best.

-greg

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#6)
Re: Add RISC-V Zbb popcount optimization

On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:

I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
all that effectively - hard to believe there's any real world workloads where
that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
world use of those platforms, making niche-y perf improvements somewhat
worthwhile. Whereas there's afaict not yet a whole lot of riscv production
adoption.

That work was partially motivated by vector stuff that used popcount
functions pretty heavily, but yeah, the complexity compared to the gains is
the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
and Neon). I'd still consider using AVX-512, etc. for things if the impact
on real-world workloads was huge, though.

--
nathan

#9Greg Burd
greg@burd.me
In reply to: Nathan Bossart (#8)
Re: Add RISC-V Zbb popcount optimization

On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:

On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:

I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
all that effectively - hard to believe there's any real world workloads where
that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
world use of those platforms, making niche-y perf improvements somewhat
worthwhile. Whereas there's afaict not yet a whole lot of riscv production
adoption.

Hey Nathan,

That work was partially motivated by vector stuff that used popcount
functions pretty heavily, but yeah, the complexity compared to the gains is
the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
and Neon). I'd still consider using AVX-512, etc. for things if the impact
on real-world workloads was huge, though.

Yes, that and by research done while trying to understand why my RISC-V build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.

--
nathan

Forgive me, while $subject only mentions popcount I couldn't help myself so I added a few more RISC-V patches including a bug fix that I hope makes greenfly happy again.

0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.

------> Join me in "the rabbit hole" on this issue if you care to...

The existing software DES (as shown by the build-farm animal "greenfly" [1]https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc) fails because Clang 20 has an auto-vectorization bug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithm itself.

I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105

My fix in 0001 is simply adding this in a few places in crypt-des.c:

#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif

While searching I ran across a different solution, adding `-mllvm -riscv-v-vector-bits-min=0` sets the minimum vector bit width for RISC-V vector extension in LLVM to 0 disabling all vectorization forcing scalar code generation, no RVV instructions are emitted. This would prevent the DES bug at the cost of any vectorization anywhere in the binary.

While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizations seems to heavy handed to me.

------> Moving on.

0002 - (was "0001" in v2) this is unchanged, it implements popcount using Zbb extension on RISC-V

0003 - is a small patch that adapted from the Google Abseil project's RISC-V CRC32C implementation [1]https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc. It is *a lot faster* than the software crc32c we fall back to now (see: riscv-crc32c.c). This algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch tests for that at build and adds the '-march' flag when it is. However, as is the case for Zbb and popcnt in, the presence of Zbc (or Zbkc) must be detected at runtime. That's done following the pre-existing pattern used for ARM features. This does introduce some runtime overhead and complexity, not more than required I hope.

I attached test code, and results at the end of this email:
* riscv-popcnt.c - unchanged
* riscv-crc32c.c - new, based on work in the Google Abseil project
* riscv-des.c - highlights the fix for DES using Clang on RISC-V

I guess the question for 002 and/or 003 is if the "juice" is worth the "squeeze" or not. There is a lot of performance juice to be had IMO. But some might argue that RISC-V isn't widely adopted yet, and they'd be right. Others might point out that RISC-V is currently showing up in embedded systems more than server/desktop/laptop/cloud, also true. However, there is some evidence that is changing as there are RISC-V in servers [2]https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042[3]https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/, and there is a hosted (cloud) solution from Scaleway [4]https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/. There exists a 64 core RISC-V desktop [6]https://deepcomputing.io/product/dc-roma-risc-v-mainboard/ and a Framework laptop mainboard [7]http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html sporting a RISC-V CPUs. And there is the OrangePi RV2 [7]http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html I have that is "greenfly".

Is it early days? Certainly! But too early? That's up for debate. :)

If nothing else, these patches can be a durable record and used later when RISC-V is a critical platform for Postgres or informational to other projects.

best.

-greg

[1]: https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc
[2]: https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
[3]: https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
[4]: https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
[5]: https://milkv.io/pioneer and https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
[6]: https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
[7]: http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html

---- TEST PROGRAM OUTPUT:

gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
gcc -O2 riscv-des.c -o des-gcc-sw
gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
clang-20 -O1 riscv-des.c -o des-clang-o1-sw
clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
clang-20 -O2 riscv-des.c -o des-clang-o2-sw
clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
./des-gcc-sw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.409 seconds (409 ns/iter)
With barriers: 0.416 seconds (416 ns/iter)
Overhead: 1.6%
./des-gcc-hw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.410 seconds (410 ns/iter)
With barriers: 0.410 seconds (410 ns/iter)
Overhead: Negligible
./des-clang-o1-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.516 seconds (516 ns/iter)
Overhead: Negligible
./des-clang-o1-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.405 seconds (405 ns/iter)
With barriers: 0.405 seconds (405 ns/iter)
Overhead: Negligible
./des-clang-o2-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.518 seconds (518 ns/iter)
Overhead: Negligible
./des-clang-o2-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
ERROR: un_pbox mismatch:
un_pbox[0] = 15, expected 8
un_pbox[1]https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc = 6, expected 16
un_pbox[2]https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042 = 19, expected 22
un_pbox[3]https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/ = 20, expected 30
un_pbox[4]https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/ = 28, expected 12
... and 27 more errors
FAIL: Permutation tables are incorrect

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.093 seconds (93 ns/iter)
With barriers: 0.407 seconds (407 ns/iter)
Overhead: 335.5%
./popcnt-gcc-o2-sw
sw popcount: 0.183 sec ( 547.89 MB/s)
hw popcount: 0.274 sec ( 365.40 MB/s)

diff: 0.67x
match: 406261900 bits counted
./popcnt-gcc-o2-hw
sw popcount: 0.182 sec ( 548.17 MB/s)
hw popcount: 0.044 sec ( 2287.82 MB/s)

diff: 4.17x
match: 406261900 bits counted
./popcnt-clang-o2-sw
sw popcount: 0.188 sec ( 531.96 MB/s)
hw popcount: 0.207 sec ( 482.84 MB/s)

diff: 0.91x
match: 406261900 bits counted
./popcnt-clang-o2-hw
sw popcount: 0.224 sec ( 446.46 MB/s)
hw popcount: 0.056 sec ( 1794.83 MB/s)

diff: 4.02x
match: 406261900 bits counted
./crc32c-gcc-o2-sw
sw crc32c: 0.651 sec ( 153.68 MB/s)
hw crc32c: 0.651 sec ( 153.72 MB/s)

diff: 1.00x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-gcc-o2-hw
sw crc32c: 0.651 sec ( 153.70 MB/s)
hw crc32c: 0.000 sec ( 308052.33 MB/s)

diff: 2004.21x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-sw
sw crc32c: 0.584 sec ( 171.10 MB/s)
hw crc32c: 0.584 sec ( 171.17 MB/s)

diff: 1.00x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-hw
sw crc32c: 0.584 sec ( 171.15 MB/s)
hw crc32c: 0.000 sec ( 309282.38 MB/s)

diff: 1807.08x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)

Attachments:

Makefile.RISCVapplication/octet-stream; name=Makefile.RISCVDownload
riscv-crc32c.ctext/x-csrc; name=riscv-crc32c.cDownload
riscv-des.ctext/x-csrc; name=riscv-des.cDownload
riscv-popcnt.ctext/x-csrc; name=riscv-popcnt.cDownload
v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patchtext/x-patch; name="=?UTF-8?Q?v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch?="Download+22-2
v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patchtext/x-patch; name="=?UTF-8?Q?v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch?="Download+226-9
v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patchtext/x-patch; name="=?UTF-8?Q?v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch?="Download+482-8
#10John Naylor
john.naylor@enterprisedb.com
In reply to: Greg Burd (#9)
clang bug affecting greenfly

[new subject]

On Sat, Mar 28, 2026 at 3:22 AM Greg Burd <greg@burd.me> wrote:

0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.

------> Join me in "the rabbit hole" on this issue if you care to...

The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an auto-vectorization bug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithm itself.

[disable vectorization entirely]
While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizations seems to heavy handed to me.

The first thing I notice is that not very long ago the buildfarm had 3
gcc RISC-V members, but not anymore. If you care about having coverage
for this hardware, I'd suggest picking up gcc again if that's still
working, and wait and see about clang. Clang has shipped broken code
generation for obscure platforms in the past, and it seems here we're
not even sure of the extent of the breakage.

--
John Naylor
Amazon Web Services

#11Greg Burd
greg@burd.me
In reply to: John Naylor (#10)
Re: clang bug affecting greenfly

On Mon, Mar 30, 2026, at 2:39 AM, John Naylor wrote:

[new subject]

On Sat, Mar 28, 2026 at 3:22 AM Greg Burd <greg@burd.me> wrote:

0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.

------> Join me in "the rabbit hole" on this issue if you care to...

The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an auto-vectorization bug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithm itself.

[disable vectorization entirely]
While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizations seems to heavy handed to me.

The first thing I notice is that not very long ago the buildfarm had 3
gcc RISC-V members, but not anymore. If you care about having coverage
for this hardware, I'd suggest picking up gcc again if that's still
working, and wait and see about clang. Clang has shipped broken code
generation for obscure platforms in the past, and it seems here we're
not even sure of the extent of the breakage.

Hey John,

All fair points. I've changed greenfly to use GCC 13.3.0, thanks for the suggestion.

--
John Naylor
Amazon Web Services

best.

-greg

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Greg Burd (#11)
Re: clang bug affecting greenfly

Hi Greg,

On 2026-Mar-30, Greg Burd wrote:

All fair points. I've changed greenfly to use GCC 13.3.0, thanks for
the suggestion.

Hmm, the 'update_personality.pl' script suplied with the buildfarm
client script allows you to change the compiler version, but not the
compiler itself -- the rationale being that a machine with a different
compiler should be a different animal. So I suggest to put greenfly to
rest until the clang situation is resolved (at which time you're welcome
to turn it back on), and request a new animal to use in the same machine
running gcc. Right now, greenfly is reporting that it is running clang
13.3.0, which AFAIK makes is inconsistent.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)

#13Greg Burd
greg@burd.me
In reply to: Alvaro Herrera (#12)
Re: clang bug affecting greenfly

On Mon, Mar 30, 2026, at 12:14 PM, Álvaro Herrera wrote:

Hi Greg,

On 2026-Mar-30, Greg Burd wrote:

All fair points. I've changed greenfly to use GCC 13.3.0, thanks for
the suggestion.

Hmm, the 'update_personality.pl' script suplied with the buildfarm
client script allows you to change the compiler version, but not the
compiler itself -- the rationale being that a machine with a different
compiler should be a different animal. So I suggest to put greenfly to
rest until the clang situation is resolved (at which time you're welcome
to turn it back on), and request a new animal to use in the same machine
running gcc. Right now, greenfly is reporting that it is running clang
13.3.0, which AFAIK makes is inconsistent.

Interesting, I was just looking for a why to change that after discovering the update_personality.pl limitation.

Sure, I'll apply for a new animal and change greenfly back.

-greg

Show quoted text

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Estoy de acuerdo contigo en que la verdad absoluta no existe...
El problema es que la mentira sí existe y tu estás mintiendo" (G. Lama)

#14John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#10)
Re: clang bug affecting greenfly

On Mon, Mar 30, 2026 at 1:39 PM John Naylor <johncnaylorls@gmail.com> wrote:

The first thing I notice is that not very long ago the buildfarm had 3
gcc RISC-V members, but not anymore.

Of course, as soon as I said that, two of them reappeared fairly
quickly after a month's absence...

--
John Naylor
Amazon Web Services

#15Greg Burd
greg@burd.me
In reply to: John Naylor (#14)
Re: clang bug affecting greenfly

On Tue, Mar 31, 2026, at 3:32 AM, John Naylor wrote:

On Mon, Mar 30, 2026 at 1:39 PM John Naylor <johncnaylorls@gmail.com> wrote:

The first thing I notice is that not very long ago the buildfarm had 3
gcc RISC-V members, but not anymore.

Of course, as soon as I said that, two of them reappeared fairly
quickly after a month's absence...

Well, add one more to that list. On the same box as "greenfly (clang)" is "mollusk (gcc)". Their configurations only differ in compiler. Right now greenfly is not reporting results (I pass the "--test" flag) given the known issues with clang. I'm happy to re-enable it if that's worth while.

best.

-greg

Show quoted text

--
John Naylor
Amazon Web Services

#16Greg Burd
greg@burd.me
In reply to: Greg Burd (#9)
Re: Add RISC-V Zbb popcount optimization

On Fri, Mar 27, 2026, at 4:22 PM, Greg Burd wrote:

On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:

On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:

I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
all that effectively - hard to believe there's any real world workloads where
that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
world use of those platforms, making niche-y perf improvements somewhat
worthwhile. Whereas there's afaict not yet a whole lot of riscv production
adoption.

Hey Nathan,

That work was partially motivated by vector stuff that used popcount
functions pretty heavily, but yeah, the complexity compared to the gains is
the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
and Neon). I'd still consider using AVX-512, etc. for things if the impact
on real-world workloads was huge, though.

Yes, that and by research done while trying to understand why my RISC-V
build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU:
RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.

--
nathan

Forgive me, while $subject only mentions popcount I couldn't help
myself so I added a few more RISC-V patches including a bug fix that I
hope makes greenfly happy again.

0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.

------> Join me in "the rabbit hole" on this issue if you care to...

The existing software DES (as shown by the build-farm animal "greenfly"
[1]) fails because Clang 20 has an auto-vectorization bug that we
trigger in the DES initialization code (des_init() function), not the
DES encryption algorithm itself.

I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105

My fix in 0001 is simply adding this in a few places in crypt-des.c:

#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif

While searching I ran across a different solution, adding `-mllvm
-riscv-v-vector-bits-min=0` sets the minimum vector bit width for
RISC-V vector extension in LLVM to 0 disabling all vectorization
forcing scalar code generation, no RVV instructions are emitted. This
would prevent the DES bug at the cost of any vectorization anywhere in
the binary.

While that might also fix the other intermittent bug we'd been seeing
on greenfly (not tested) disablnig all RVV optimizations seems to heavy
handed to me.

------> Moving on.

0002 - (was "0001" in v2) this is unchanged, it implements popcount
using Zbb extension on RISC-V

0003 - is a small patch that adapted from the Google Abseil project's
RISC-V CRC32C implementation [1]. It is *a lot faster* than the
software crc32c we fall back to now (see: riscv-crc32c.c). This
algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch
tests for that at build and adds the '-march' flag when it is.
However, as is the case for Zbb and popcnt in, the presence of Zbc (or
Zbkc) must be detected at runtime. That's done following the
pre-existing pattern used for ARM features. This does introduce some
runtime overhead and complexity, not more than required I hope.

I attached test code, and results at the end of this email:
* riscv-popcnt.c - unchanged
* riscv-crc32c.c - new, based on work in the Google Abseil project
* riscv-des.c - highlights the fix for DES using Clang on RISC-V

I guess the question for 002 and/or 003 is if the "juice" is worth the
"squeeze" or not. There is a lot of performance juice to be had IMO.
But some might argue that RISC-V isn't widely adopted yet, and they'd
be right. Others might point out that RISC-V is currently showing up
in embedded systems more than server/desktop/laptop/cloud, also true.
However, there is some evidence that is changing as there are RISC-V in
servers [2][3], and there is a hosted (cloud) solution from Scaleway
[4]. There exists a 64 core RISC-V desktop [6] and a Framework laptop
mainboard [7] sporting a RISC-V CPUs. And there is the OrangePi RV2
[7] I have that is "greenfly".

Is it early days? Certainly! But too early? That's up for debate. :)

If nothing else, these patches can be a durable record and used later
when RISC-V is a critical platform for Postgres or informational to
other projects.

Rebased and tested (v4) adding better support for RISC-V with a fix for DES and faster popcount and CRC32 when the CPU supports it.

best.

-greg

Show quoted text

best.

-greg

[1] https://github.com/abseil/abseil-cpp/pull/1986
absl/crc/internal/crc_riscv.cc
[2]
https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
[3]
https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
[4]
https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
[5] https://milkv.io/pioneer and
https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
[6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
[7]
http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html

---- TEST PROGRAM OUTPUT:

gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
gcc -O2 riscv-des.c -o des-gcc-sw
gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
clang-20 -O1 riscv-des.c -o des-clang-o1-sw
clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
clang-20 -O2 riscv-des.c -o des-clang-o2-sw
clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
./des-gcc-sw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.409 seconds (409 ns/iter)
With barriers: 0.416 seconds (416 ns/iter)
Overhead: 1.6%
./des-gcc-hw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.410 seconds (410 ns/iter)
With barriers: 0.410 seconds (410 ns/iter)
Overhead: Negligible
./des-clang-o1-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.516 seconds (516 ns/iter)
Overhead: Negligible
./des-clang-o1-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.405 seconds (405 ns/iter)
With barriers: 0.405 seconds (405 ns/iter)
Overhead: Negligible
./des-clang-o2-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled

Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.518 seconds (518 ns/iter)
Overhead: Negligible
./des-clang-o2-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)

Testing WITHOUT compiler barriers:
ERROR: un_pbox mismatch:
un_pbox[0] = 15, expected 8
un_pbox[1] = 6, expected 16
un_pbox[2] = 19, expected 22
un_pbox[3] = 20, expected 30
un_pbox[4] = 28, expected 12
... and 27 more errors
FAIL: Permutation tables are incorrect

Testing WITH compiler barriers:
PASS: Permutation tables are correct

Performance Comparison (1000000 iterations):
Without barriers: 0.093 seconds (93 ns/iter)
With barriers: 0.407 seconds (407 ns/iter)
Overhead: 335.5%
./popcnt-gcc-o2-sw
sw popcount: 0.183 sec ( 547.89 MB/s)
hw popcount: 0.274 sec ( 365.40 MB/s)

diff: 0.67x
match: 406261900 bits counted
./popcnt-gcc-o2-hw
sw popcount: 0.182 sec ( 548.17 MB/s)
hw popcount: 0.044 sec ( 2287.82 MB/s)

diff: 4.17x
match: 406261900 bits counted
./popcnt-clang-o2-sw
sw popcount: 0.188 sec ( 531.96 MB/s)
hw popcount: 0.207 sec ( 482.84 MB/s)

diff: 0.91x
match: 406261900 bits counted
./popcnt-clang-o2-hw
sw popcount: 0.224 sec ( 446.46 MB/s)
hw popcount: 0.056 sec ( 1794.83 MB/s)

diff: 4.02x
match: 406261900 bits counted
./crc32c-gcc-o2-sw
sw crc32c: 0.651 sec ( 153.68 MB/s)
hw crc32c: 0.651 sec ( 153.72 MB/s)

diff: 1.00x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-gcc-o2-hw
sw crc32c: 0.651 sec ( 153.70 MB/s)
hw crc32c: 0.000 sec ( 308052.33 MB/s)

diff: 2004.21x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-sw
sw crc32c: 0.584 sec ( 171.10 MB/s)
hw crc32c: 0.584 sec ( 171.17 MB/s)

diff: 1.00x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-hw
sw crc32c: 0.584 sec ( 171.15 MB/s)
hw crc32c: 0.000 sec ( 309282.38 MB/s)

diff: 1807.08x
match: 0x0B141F2D

validation: CRC32C("123456789") = 0xE3069283 (correct)
Attachments:
* Makefile.RISCV
* riscv-crc32c.c
* riscv-des.c
* riscv-popcnt.c
* v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch
* v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch
* v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch

Attachments:

v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patchtext/x-patch; name="=?UTF-8?Q?v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch?="Download+22-2
v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patchtext/x-patch; name="=?UTF-8?Q?v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patch?="Download+253-9
v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patchtext/x-patch; name="=?UTF-8?Q?v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch?="Download+482-8
v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patchtext/x-patch; name="=?UTF-8?Q?v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch?="Download+22-2
v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patchtext/x-patch; name="=?UTF-8?Q?v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patch?="Download+253-9
v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patchtext/x-patch; name="=?UTF-8?Q?v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch?="Download+482-8
#17Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#16)
Re: Add RISC-V Zbb popcount optimization

Hi,

On 2026-05-27 13:04:46 -0400, Greg Burd wrote:

Forgive me, while $subject only mentions popcount I couldn't help
myself so I added a few more RISC-V patches including a bug fix that I
hope makes greenfly happy again.

0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.

How confident are we that this bug just affects DES?

------> Join me in "the rabbit hole" on this issue if you care to...

The existing software DES (as shown by the build-farm animal "greenfly"
[1]) fails because Clang 20 has an auto-vectorization bug that we
trigger in the DES initialization code (des_init() function), not the
DES encryption algorithm itself.

I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105

Have you confirmed that, by using a newer clang, the merging of the fixes
actually fixes the problem?

ISTM a perfectly viable patch would be to just reject building with a
non-very-recent clang on riscv.

My fix in 0001 is simply adding this in a few places in crypt-des.c:

#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif

That seems like a pretty odd fix for the problem. If the problem is
auto-vectorization, we should stop auto-vectorization, not sprinkle memory
barriers around. Either by pushing disabling of auto-vectorization on the
file scope (using a #pragma clang push), loop scope (using vectorize(disable))
or just globally (by disabling it on riscv + clang, if the compiler is too
old).

Greetings,

Andres Freund

#18Greg Burd
greg@burd.me
In reply to: Andres Freund (#17)
Re: Add RISC-V Zbb popcount optimization

Hi,

Thanks for taking a look at this.

On 2026-05-28 14:17:16 +0000, Andres Freund wrote:

How confident are we that this bug just affects DES?

It doesn't. I scanned master for the scatter-write idiom that des_init() uses
(`dst[idx[i]] = expr`) and then compiled each candidate on greenfly with both
gcc-13 and clang-20 at -O2 with -march=rv64gcv, diffing the disassembly for
indexed-store instructions. The static scan turned up ~30 sites; clang-20
actually emits RVV scatter/strided stores in three of them, where gcc-13 emits
scalar code:

contrib/pgcrypto/crypt-des.c des_init 3x vsoxei8 (the known case)
src/timezone/zic.c ~L2330 omittype[] 2x vsoxei8 (byte dest, same shape)
contrib/pg_trgm/trgm_op.c compactTrgm 3x vsse8 (strided, not indexed)

The zic.c hit is the one that bothers me. It's the same shape as des_init() --
byte-sized destination, scatter store via an indexing array -- and it's not
exercised by `make check`, since zic runs at `make install` to compile the IANA
tz data. If clang-20 miscompiles zic the same way it miscompiles des_init(),
the buildfarm wouldn't notice and the resulting tzdata would silently be wrong.
I haven't yet built and run that path on greenfly with clang-20 to confirm the
miscompile actually triggers there; I'll do that next.

The trgm_op.c hit is structurally different (strided store, not indexed
scatter), so it isn't the #176001 pattern. I'm flagging it because it's in the
same family of clang-20 RVV transformations on byte destinations, not because I
have a known miscompile for it.

Several other sites that match the source pattern (ginlogic.c MAYBE- twiddle,
spi.c attribute scatter, spgtextproc.c, ...) survive clang-20 codegen
unvectorized -- the autovec heuristics don't fire on the variable-length,
control-flow-dependent loops where this idiom tends to live in our tree. So the
*visible* exposure today appears bounded to crypt-des.c and zic.c, but I would
not bet money that the next clang release won't extend the autovec to those.

One more data point: the scatter stores only appear at -O2 and -O3, not -O1.
Our default is -O2.

Have you confirmed that, by using a newer clang, the merging of the
fixes actually fixes the problem?

Not yet. apt.llvm.org doesn't ship riscv64 packages, so I'm building clang from
llvm-project main on greenfly now (currently mid-build, ~24h on this CPU under
nice). Once it's installed I'll rerun the asm audit and the test suite under
it; if HEAD generates correct code for crypt-des.c and zic.c that's a real data
point for "reject affected clang" and gives us a version range for the cutoff.

ISTM a perfectly viable patch would be to just reject building with a
non-very-recent clang on riscv.

It's the cleanest answer if HEAD is in fact fixed, and I'd lean that way. The
cost is excluding the clang that ships in current Debian/ Ubuntu stable on
riscv64 (clang 20 in noble / trixie), which is what most riscv64 users actually
have today. I'd rather have the clang-HEAD bisect data before picking the
cutoff -- "reject < clang-N" is much more defensible with N pinned to a specific
fix.

That seems like a pretty odd fix for the problem. If the problem is
auto-vectorization, we should stop auto-vectorization, not sprinkle
memory barriers around.

Agreed. pg_memory_barrier() was the first thing that made the test pass and I
shipped it without going back to the right primitive. The audit makes per-loop
pragmas look worse, not better: the affected sites aren't all in one file, and
one of them (zic.c) is third-party- ish code we mostly don't touch. In rough
order of locality:

1. #pragma clang loop vectorize(disable) on the four affected loops
in des_init() (and the equivalent in zic.c, if confirmed). Doesn't
scale if the next clang release vectorizes a fifth site.
2. A pg_attribute_no_vectorize on des_init() / the affected zic
function. Coarser, defensive against new scatter writes inside
the same function.
3. Configure-time -fno-tree-vectorize on crypt-des.c (and zic.c)
for clang on riscv. File scope. Simple and static.
4. Configure-time -fno-tree-vectorize globally for clang+riscv64
until clang-N, where N is the bisected fix. Or refuse < N.
5. Hard error in configure for clang versions in the affected range.

After the audit my preference shifted from (1) to (3) or (4): the bug isn't
DES-specific, the patch shouldn't be either, and decorating loops one at a time
as we trip over them is exactly the "sprinkle barriers around" complaint applied
a level up. (4) would also fix the secondary intermittent issue we've seen on
greenfly that I haven't been able to attribute to any specific loop.

I'll send v5 once I have:
- clang-HEAD on greenfly and the rerun audit/test results
- confirmation (or refutation) that the zic.c miscompile actually
triggers, not just that the scatter store is emitted

If both confirm what the asm audit suggests, v5 will drop the
pg_memory_barrier() approach in 0001 and replace it with either (3)
or (4)+(5), depending on whether clang-HEAD passes.

The other two patches do improve CRC32 and popcount on RISC-V, they are still
worth considering.

best.

-greg

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Burd (#18)
Re: Add RISC-V Zbb popcount optimization

"Greg Burd" <greg@burd.me> writes:

The zic.c hit is the one that bothers me. It's the same shape as des_init() --
byte-sized destination, scatter store via an indexing array -- and it's not
exercised by `make check`, since zic runs at `make install` to compile the IANA
tz data.

"make check" certainly *will* run zic.c, while creating the temporary
installation. (Unless you use --with-system-tzdata, which I'm sure
most production builds do.) But our test suite touches only a very
tiny fraction of the tzdata files, so even if there's some
corruption in them we wouldn't necessarily notice it.

regards, tom lane

#20Greg Burd
greg@burd.me
In reply to: Tom Lane (#19)
Re: Add RISC-V Zbb popcount optimization

On Fri, May 29, 2026, at 12:23 PM, Tom Lane wrote:

"Greg Burd" <greg@burd.me> writes:

The zic.c hit is the one that bothers me. It's the same shape as des_init() --
byte-sized destination, scatter store via an indexing array -- and it's not
exercised by `make check`, since zic runs at `make install` to compile the IANA
tz data.

"make check" certainly *will* run zic.c, while creating the temporary
installation. (Unless you use --with-system-tzdata, which I'm sure
most production builds do.) But our test suite touches only a very
tiny fraction of the tzdata files, so even if there's some
corruption in them we wouldn't necessarily notice it.

Apologies, you are correct about "make check". I think it's still bothersome that a bug can hide in plain sight, but I've not thought about how to supplement tests to fix that.

Still waiting on LLVM/Clang to compile...

regards, tom lane

best.

-greg