[PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Started by Greg Burd22 days ago26 messages
#1Greg Burd
Greg Burd
greg@burd.me
2 attachment(s)

Hi all,

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.

Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue. Dave got me started on the path
to finding the issue by sending me his work around:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * before the data page can be written out.  This implements the basic
  * WAL rule "write the log before the data".)
  */
+#pragma optimize("",off)
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
                                 XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,

return EndPos;
}
-
+#pragma optimize("",on)
/*

This pointed a finger at the atomics, so I started there. We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform. I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn

PROBLEM DESCRIPTION

PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.

The root cause has two components:

1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64

TECHNICAL ANALYSIS

PART 1: ATOMIC OPERATIONS MEMORY BARRIERS

GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:

* LSE path (modern ARM64): Using CASAL instruction with built-in
memory ordering [1]ARM Developer: CAS Instructions https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-[2]ARM Developer: Load-Acquire and Store-Release Instructions https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
explicit dmb sy instruction [3]ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4]Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170[5]Microsoft Learn: ARM Intrinsics - Memory Barriers https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170.

PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS

The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:

#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)

Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers

The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6]Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170. It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.

IMPACT ON 027_STREAM_REGRESS

The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7]PostgreSQL: 027_stream_regress WAL replication testing /messages/by-id/193115.1763243897@sss.pgh.pa.us. Without proper barriers on ARM64:

1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up

WHY ARM32 AND X86/X64 ARE UNAFFECTED

MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]Intel Volume 3A: Memory Ordering https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]Microsoft Learn: ARM Intrinsics - Memory Barriers https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170

Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.

ATTACHED SOLUTION

Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.

0002: src/inclue/port/atomic/generic-msvc.h

Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.

Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()

0001: src/include/storage/s_lock.h

Added ARM64-specific spinlock implementation with explicit DMB barriers [9]Microsoft Developer Blog: The AArch64 processor - Barriers https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968:

#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)

static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
int result;

/* Full barrier before atomic operation */
__dmb(_ARM64_BARRIER_SY);

/* Atomic compare-and-swap */
result = InterlockedCompareExchange(lock, 1, 0);

/* Full barrier after atomic operation */
__dmb(_ARM64_BARRIER_SY);

return result;
}

#define S_UNLOCK(lock)
do {
__dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
((lock)) = 0;
} while (0)

#else
/* Non-ARM64 MSVC: existing implementation unchanged */
#endif

The spinlock acquire now ensures:

* Before CAS: All prior memory operations complete before
acquiring the lock.

* After CAS: The CAS completes before subsequent operations
access protected data

The spinlock release now ensures:

* Before writing 0: All critical section operations are visible
to other threads

You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
semantics [3]ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
* STLXR (Store-Release Exclusive) provides half-barrier release
semantics [3]ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
* One final dmb sy upgrades to full sequential consistency

Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:

* First DMB acts as a release barrier (ensures prior memory ops
complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
ops wait for CAS)
* Together they provide sequential consistency matching GCC's
semantics [3]ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

VERIFICATION

The fix has been verified by:

1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
consistently on MSVC ARM64 with /O2 optimization without hanging

2. Assembly code inspection: Confirmed that dmb sy instructions now
appear in the optimized assembly for ARM64 builds

3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
expand to no-ops; original code path unchanged)

WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM

PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.

CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)

Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].

On ARM64, Clang generates:

__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.

Phew... I hope I read all those docs correctly and got that right. Feel
free to let me know if I missed something. Looking forward to your
feedback and review so I can get this new build animal up and running.

best.

-greg

[1]: ARM Developer: CAS Instructions https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-

[2]: ARM Developer: Load-Acquire and Store-Release Instructions https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

[3]: ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

[4]: Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170

[5]: Microsoft Learn: ARM Intrinsics - Memory Barriers https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170

[6]: Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170

[7]: PostgreSQL: 027_stream_regress WAL replication testing /messages/by-id/193115.1763243897@sss.pgh.pa.us
/messages/by-id/193115.1763243897@sss.pgh.pa.us

[8]: Intel Volume 3A: Memory Ordering https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[9]: Microsoft Developer Blog: The AArch64 processor - Barriers https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968

Attachments:

v1-0001-Address-build-issues-for-ARM64-using-MSVC.patchapplication/octet-stream
v1-0002-DMB-barries-fix-memory-ordering-on-MSVC-ARM64.patchapplication/octet-stream
#2Peter Eisentraut
Peter Eisentraut
peter@eisentraut.org
In reply to: Greg Burd (#1)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

On 20.11.25 21:45, Greg Burd wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

-  zlib_t = dependency('zlib', required: zlibopt)
+  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)

This appears to be a change unrelated to your patch description.

Also, the second patch contains a number of random whitespace changes.
It would be good to clean that up so the patch is easier to analyze.

#3Greg Burd
Greg Burd
greg@burd.me
In reply to: Peter Eisentraut (#2)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

On Nov 20 2025, at 5:00 pm, Peter Eisentraut <peter@eisentraut.org> wrote:

On 20.11.25 21:45, Greg Burd wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

Hi Peter,

Thanks for taking a second to review.

-  zlib_t = dependency('zlib', required: zlibopt)
+  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)

This appears to be a change unrelated to your patch description.

Hmmm, these are changes that Dave added to get things to compile. They
worked for me but I'll review them again in the morning and update the description.

Also, the second patch contains a number of random whitespace changes.
It would be good to clean that up so the patch is easier to analyze.

Certainly, apologies. I'll clean that up first thing in the morning.

-greg

#4Nathan Bossart
Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Burd (#3)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
 #include <arm_acle.h>
 unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC. In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"
+#ifndef _MSC_VER
 #include <arm_acle.h>
+#endif

Hm. Doesn't MSVC require intrin.h?

--
nathan

#5Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#1)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hi,

On 2025-11-20 15:45:22 -0500, Greg Burd wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

Thanks for working on this!

This pointed a finger at the atomics, so I started there. We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform. I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn

PROBLEM DESCRIPTION

PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.

The root cause has two components:

1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64

TECHNICAL ANALYSIS

PART 1: ATOMIC OPERATIONS MEMORY BARRIERS

GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:

* LSE path (modern ARM64): Using CASAL instruction with built-in
memory ordering [1][2]

* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
explicit dmb sy instruction [3]

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
(at least for most of them). And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction. Looking that
up shows:
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6]. It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

Yea, that seems broken on a non-TSO architecture. Is the problem fixed if you
change just this to include a proper barrier?

Greetings,

Andres Freund

#6Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hi,

On 2025-11-20 19:03:47 -0500, Andres Freund wrote:

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
(at least for most of them). And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction. Looking that
up shows:
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."

Further evidence for that is that
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
states:
"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

(note that we are using the function, not the intrinsic for TAS())

Greetings,

Andres

#7Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Greg Burd (#1)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

A couple of immediate thoughts:

https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170

Doesn't seem to match your conclusion.

+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', false)
+    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)

USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
actually work on Windows, but I don't think we should waste time
implementing it: vendor-supported versions of Windows 11 require
ARMv8.1A to boot[1]https://en.wikipedia.org/wiki/Windows_11_version_history[2]https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro, and that has it, so I think we should probably
just define USE_ARMV8_CRC32C.

+static __forceinline void
+spin_delay(void)
+{
+     /* Reference:
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
*/
+    __isb(_ARM64_BARRIER_SY);
+}

I don't doubt that barriers are missing in a few places, but how can
this be the right place?

If you have an environment set up so it's easy to test, I would also
be very interested to know if my patch set[3]/messages/by-id/CA+hUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA@mail.gmail.com that nukes all this
stuff and includes <stdatomic.h> instead, which is green on
Windows/x86 CI, will just work™ there too.

[1]: https://en.wikipedia.org/wiki/Windows_11_version_history
[2]: https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
[3]: /messages/by-id/CA+hUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA@mail.gmail.com

#8Dave Cramer
Dave Cramer
davecramer@gmail.com
In reply to: Nathan Bossart (#4)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

On Thu, 20 Nov 2025 at 17:36, Nathan Bossart <nathandbossart@gmail.com>
wrote:

I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
#include <arm_acle.h>
unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC. In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
*/
#include "c.h"
+#ifndef _MSC_VER
#include <arm_acle.h>
+#endif

Hm. Doesn't MSVC require intrin.h?

--
nathan

I posted this as a bug with Microsoft
https://developercommunity.visualstudio.com/t/MSVCs-_InterlockedCompareExchange-doe/11004239

Dave

#9Greg Burd
Greg Burd
greg@burd.me
In reply to: Andres Freund (#5)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Nov 20 2025, at 7:03 pm, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-11-20 15:45:22 -0500, Greg Burd wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

Thanks for working on this!

You're welcome, thanks for reviewing it. :)

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the
intrinsics inline
(at least for most of them). And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction.
Looking that
up shows:
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."

I didn't even think to check for a compiler flag for the architecture,
nice call! If this emits the correct instructions it is a much better
approach. I'll give it a try, thanks for the nudge.

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6]. It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

Yea, that seems broken on a non-TSO architecture. Is the problem
fixed if you change just this to include a proper barrier?

Using the flag from above the _ReadWriteBarrier() does (in godbolt) turn
into a casal which (AFAIK) is going to do the trick. I'll see if I can
update meson.build and get this work as intended.

Greetings,

Andres Freund

best.

-greg

#10Greg Burd
Greg Burd
greg@burd.me
In reply to: Andres Freund (#6)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Nov 20 2025, at 7:07 pm, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-11-20 19:03:47 -0500, Andres Freund wrote:

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the
intrinsics inline
(at least for most of them). And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction.
Looking that
up shows:
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."

Further evidence for that is that
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
states:
"This function generates a full memory barrier (or fence) to ensure
that memory operations are completed in order."

(note that we are using the function, not the intrinsic for TAS())

Got it, thanks.

Greetings,

Andres

best.

-greg

#11Greg Burd
Greg Burd
greg@burd.me
In reply to: Nathan Bossart (#4)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

On Nov 20 2025, at 5:36 pm, Nathan Bossart <nathandbossart@gmail.com> wrote:

I took a quick look at 0001.

Thanks for taking a second to review!

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
#include <arm_acle.h>
unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC. In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
*/
#include "c.h"
+#ifndef _MSC_VER
#include <arm_acle.h>
+#endif

Hm. Doesn't MSVC require intrin.h?

It does in fact fail to compile without this part of the patch, I think
Dave posted a bug about this. I added the missing endif, thanks!

--
nathan

best.

-greg

#12Greg Burd
Greg Burd
greg@burd.me
In reply to: Thomas Munro (#7)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Nov 20 2025, at 7:08 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:

Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

A couple of immediate thoughts:

Hey Thomas, I'm always interested in your thoughts. Thanks for taking a
look at this.

https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170

Doesn't seem to match your conclusion.

+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', false)
+    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)

USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
actually work on Windows, but I don't think we should waste time
implementing it: vendor-supported versions of Windows 11 require
ARMv8.1A to boot[1][2], and that has it, so I think we should probably
just define USE_ARMV8_CRC32C.

I'll give that a go, I think your logic is sound.

+static __forceinline void
+spin_delay(void)
+{
+     /* Reference:
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
*/
+    __isb(_ARM64_BARRIER_SY);
+}

I don't doubt that barriers are missing in a few places, but how can
this be the right place?

Well, with the compiler flag that Andres mentioned I need to reconsider
this change.

If you have an environment set up so it's easy to test, I would also
be very interested to know if my patch set[3] that nukes all this
stuff and includes <stdatomic.h> instead, which is green on
Windows/x86 CI, will just work™ there too.

I do, and I will as soon as I get this sorted.

[1] https://en.wikipedia.org/wiki/Windows_11_version_history
[2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
[3] /messages/by-id/CA+hUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA@mail.gmail.com

best.

-greg

#13Greg Burd
Greg Burd
greg@burd.me
In reply to: Greg Burd (#12)
3 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Okay,

With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
had to update the S_UNLOCK() macro, the compiler did the rest correctly
AFAICT. So, a much smaller patch (v2) attached. FWIW I'm using Visual
Studio 2026 (18) to build, other platform information below [1]OS Name Microsoft Windows 11 Pro Version 10.0.26100 Build 26100 Other OS Description Not Available OS Manufacturer Microsoft Corporation System Name SANTORINI System Manufacturer Microsoft Corporation System Model Windows Dev Kit 2023 System Type ARM64-based PC System SKU 2043 Processor Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s) BIOS Version/Date Microsoft Corporation 13.42.235, 11/13/2024.

The white space/formatting issues seem to have been due to my ineptitude
on Windows running pgindent, or maybe I can blame Perl. I'll try to dig
if I get a minute to figure out what was the cause and if I need to
patch something.

Also attached is a file with some notes on how I build. The build farm
section isn't finished yet, but there is value in the rest for anyone
doing similar work. If you're wondering how I learned PowerShell, I
didn't. I used AI, forgive me. Eventually I'll have a build animal to
add to the "farm" and update a wiki page somewhere when this solidifies
a bit more. :)

best.

-greg

[1]: OS Name Microsoft Windows 11 Pro Version 10.0.26100 Build 26100 Other OS Description Not Available OS Manufacturer Microsoft Corporation System Name SANTORINI System Manufacturer Microsoft Corporation System Model Windows Dev Kit 2023 System Type ARM64-based PC System SKU 2043 Processor Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s) BIOS Version/Date Microsoft Corporation 13.42.235, 11/13/2024
Version 10.0.26100 Build 26100
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name SANTORINI
System Manufacturer Microsoft Corporation
System Model Windows Dev Kit 2023
System Type ARM64-based PC
System SKU 2043
Processor Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s)
BIOS Version/Date Microsoft Corporation 13.42.235, 11/13/2024

Attachments:

v2-0001-Address-build-issues-for-ARM64-using-MSVC.patchapplication/octet-stream
v2-0002-Fix-Win11-MSVC-ARM64-atomic-intrinsics.patchapplication/octet-stream
pg-arm64-notes.txttext/plain
#14Greg Burd
Greg Burd
greg@burd.me
In reply to: Thomas Munro (#7)
2 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Thu, Nov 20, 2025, at 7:08 PM, Thomas Munro wrote:

If you have an environment set up so it's easy to test, I would also
be very interested to know if my patch set[3] that nukes all this
stuff and includes <stdatomic.h> instead, which is green on
Windows/x86 CI, will just work™ there too.

[3]
/messages/by-id/CA+hUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA@mail.gmail.com

Hello Thomas,

I gave your stdatomic patchs a try, here's where it fell apart linking.

[174/1996] Linking target src/interfaces/libpq/libpq.dll
Creating library src/interfaces/libpq\libpq.lib
[981/1996] Linking target src/interfaces/ecpg/pgtypeslib/libpgtypes.dll
Creating library src\interfaces\ecpg\pgtypeslib\libpgtypes.lib
[1235/1996] Linking target src/interfaces/ecpg/ecpglib/libecpg.dll
Creating library src\interfaces\ecpg\ecpglib\libecpg.lib
[1401/1996] Linking target src/backend/postgres.exe
FAILED: [code=1120] src/backend/postgres.exe src/backend/postgres.pdb
"link" @src/backend/postgres.exe.rsp
Creating library src\backend\postgres.lib
storage_lmgr_s_lock.c.obj : error LNK2019: unresolved external symbol _mm_pause referenced in function perform_spin_delay
src\backend\postgres.exe : fatal error LNK1120: 1 unresolved externals
[1410/1996] Compiling C object src/backend/snowball/dict_snowball.dll.p/libstemmer_stem_UTF_8_german.c.obj
ninja: build stopped: subcommand failed.

I used your v2 and the attached reduced set of Dave and my changes. I'll work on this next week if I find a bit of time, shouldn't be all that hard. I thought I'd update you before then. Also attached is my config script, I sent out my notes on a separate email.

I forgot to provide the MSVC compiler/linker versions for posterity, they are:
cl (msvc 19.50.35718 "Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35718 for ARM64")
link 14.50.35718.0

best.

-greg

Attachments:

v20251123-0001-WIP-ARM64-MSVC-stdatomic.patchapplication/octet-stream; name="=?UTF-8?Q?v20251123-0001-WIP-ARM64-MSVC-stdatomic.patch?="
ConfigMSVC.ps1application/octet-stream; name=ConfigMSVC.ps1
#15Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#13)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hi,

On 2025-11-22 16:43:30 -0500, Greg Burd wrote:

With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
had to update the S_UNLOCK() macro, the compiler did the rest correctly
AFAICT.

Just to be clear - the flag shouldn't be necessary for things to work
correctly. I was only using it to have godbolt inline the intrinsics, rather
than have them generate an out-of-line function call that I couldn't easily
inspect. I'm fairly sure that the out-of-line functions also have the relevant
barriers.

@@ -2509,7 +2513,10 @@ int main(void)
}
'''

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', 1)
+    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...

Greetings,

Andres Freund

#16Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#15)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', 1)
+    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...

I was wondering about that in light of your revelation that it must be
using /arch:armv8.0 by default. If we only support Windows/ARM on
armv8.1 hardware (like Windows 11 itself), then I think that means
that we'd want /arch:armv8.1 here:

+  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
+  if host_machine.system() == 'windows' and host_machine.cpu_family()
== 'aarch64'
+    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
+  endif

We can't impose our own random high ISA requirement like that, or some
machines will choke on illegal instructions.

I wonder if the same applies to Visual Studio on x86. The OS now
requires x86-64-v2 (like RHEL9, and many other Linux distros except
Debian?), but Visual Studio might not know that... but then we might
say that's an issue for the EDB packaging crew to think about, not us.

In that case we might want to say "OK you choose the ARM version, but
we're not going to write the runtime test for CRC on armv8, we'll do a
compile-time test only, because it would be stupid to waste time
writing code for armv8.0".

#17Greg Burd
Greg Burd
greg@burd.me
In reply to: Andres Freund (#15)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Nov 23 2025, at 10:55 am, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-11-22 16:43:30 -0500, Greg Burd wrote:

With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
had to update the S_UNLOCK() macro, the compiler did the rest correctly
AFAICT.

Just to be clear - the flag shouldn't be necessary for things to work
correctly. I was only using it to have godbolt inline the intrinsics, rather
than have them generate an out-of-line function call that I couldn't easily
inspect. I'm fairly sure that the out-of-line functions also have the relevant
barriers.

Well, I learned something new about MSVC (again) today. Thanks Andres
for pointing that out. I fiddled with godbolt a bit and indeed on ARM64
systems the only thing that flag (/arch:armv9.4) does is to replace the
out-of-line function call for the intrinsic with the inlined version of it.

Without:
bl _InterlockedCompareExchange

With:
mov w10,#2
str w8,[sp]
mov x9,sp
casal w8,w10,[x9]

Good to know, and yes it does seem that the ASM includes the correct
instruction sequence.

I think, except for the build issues, the one thing I can put my finger
on as a "fix" is the change from _ReadWriteBarrier() to _dmb(_ARM64_BARRIER_SY).

The docs say:
The _ReadWriteBarrier intrinsic limits the compiler optimizations that
can remove or reorder memory accesses across the point of the call.

Note that it says "compiler optimization" not "system memory barrier".
So, this macro change:

 #define S_UNLOCK(lock)	\
-	do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
+	do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)

Makes more sense. This change replaces a compiler-only barrier with a
full system memory barrier, fundamentally strengthening memory ordering
guarantees for spinlock release on ARM64.

When I remove the compiler flag from the patch and keep the S_UNLOCK()
change the tests pass. I think we've found the critical fix. I'll
update the patch set.

@@ -2509,7 +2513,10 @@ int main(void)
}
'''

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
__crc32cd without -march=armv8-a+crc',
+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', 1)
+    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...

Sure, the more I look at this the more I want to clean it up a bit more.

Greetings,

Andres Freund

Thanks again for the helpful nudge Andres, best.

-greg

#18Greg Burd
Greg Burd
greg@burd.me
In reply to: Thomas Munro (#16)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Nov 23 2025, at 3:32 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:

- if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and

__crc32cd without -march=armv8-a+crc',

+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', 1)
+    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...

I was wondering about that in light of your revelation that it must be
using /arch:armv8.0 by default. If we only support Windows/ARM on
armv8.1 hardware (like Windows 11 itself), then I think that means
that we'd want /arch:armv8.1 here:

+  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
+  if host_machine.system() == 'windows' and host_machine.cpu_family()
== 'aarch64'
+    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
+  endif

Hey Thomas,

Turns out that flag isn't required at all, the issue was the S_UNLOCK()
implementation's use of a compiler barrier not a CPU memory barrier.

We can't impose our own random high ISA requirement like that, or some
machines will choke on illegal instructions.

New incoming patch drops this entirely, so no need to worry about it.

I wonder if the same applies to Visual Studio on x86. The OS now
requires x86-64-v2 (like RHEL9, and many other Linux distros except
Debian?), but Visual Studio might not know that... but then we might
say that's an issue for the EDB packaging crew to think about, not us.

In that case we might want to say "OK you choose the ARM version, but
we're not going to write the runtime test for CRC on armv8, we'll do a
compile-time test only, because it would be stupid to waste time
writing code for armv8.0".

best.

-greg

#19Greg Burd
Greg Burd
greg@burd.me
In reply to: Greg Burd (#18)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hello,

v3 attached is a more concise single patch that adds support for MSVC on Win11 ARM64. The issues fixed are mostly related to config, build, docs, an implementation of spin_delay() and a change to S_UNLOCK() macro. I've squished the work we did into a single commit and improved the commit message. Thomas, I think in another thread you mentioned that some of what Dave added came from you, correct? We should add you as another author if so, right?

best.

-greg

Attachments:

v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patchapplication/octet-stream; name="=?UTF-8?Q?v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patc?= =?UTF-8?Q?h?="
#20Greg Burd
Greg Burd
greg@burd.me
In reply to: Greg Burd (#19)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Mon, Nov 24, 2025, at 10:04 AM, Greg Burd wrote:

Hello,

v3 attached is a more concise single patch that adds support for MSVC
on Win11 ARM64. The issues fixed are mostly related to config, build,
docs, an implementation of spin_delay() and a change to S_UNLOCK()
macro. I've squished the work we did into a single commit and improved
the commit message. Thomas, I think in another thread you mentioned
that some of what Dave added came from you, correct? We should add you
as another author if so, right?

best.

-greg

Attachments:
* v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patch

I forgot to account for the non-ARM64 implementation of S_UNLOCK(), thanks CI for pointing that out.

-greg

Attachments:

v4-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patchapplication/octet-stream; name="=?UTF-8?Q?v4-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patc?= =?UTF-8?Q?h?="
#21Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Greg Burd (#20)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hi,

On 2025-11-24 11:28:28 -0500, Greg Burd wrote:

@@ -2509,25 +2513,64 @@ int main(void)
}
'''

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
-      args: test_c_args)
-    # Use ARM CRC Extension unconditionally
-    cdata.set('USE_ARMV8_CRC32C', 1)
-    have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
-      args: test_c_args + ['-march=armv8-a+crc+simd'])
-    # Use ARM CRC Extension, with runtime check
-    cflags_crc += '-march=armv8-a+crc+simd'
-    cdata.set('USE_ARMV8_CRC32C', false)
-    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
-    have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
-      args: test_c_args + ['-march=armv8-a+crc'])
-    # Use ARM CRC Extension, with runtime check
-    cflags_crc += '-march=armv8-a+crc'
-    cdata.set('USE_ARMV8_CRC32C', false)
-    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
-    have_optimized_crc = true
+  if cc.get_id() == 'msvc'
+    # MSVC: Intrinsic availability check for ARM64
+    if host_machine.cpu_family() == 'aarch64'
+      # Test if CRC32C intrinsics are available in intrin.h
+      crc32c_test_msvc = '''
+        #include <intrin.h>
+        int main(void) {
+          uint32_t crc = 0;
+          uint8_t data = 0;
+          crc = __crc32cb(crc, data);
+          return 0;
+        }
+      '''
+      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
+        cdata.set('USE_ARMV8_CRC32C', 1)
+        have_optimized_crc = true
+        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
+      else
+        message('CRC32C intrinsics not available on this MSVC ARM64 build')
+      endif

Does this:
a) need to be conditional at all, given that it's msvc specific, it seems we
don't need to run a test?
b) why is the msvc block outside of the general aarch64 block but then has
another nested aarch64 test inside? That seems unnecessarily complicated and
requires reindenting unnecessarily much code?

+/*
+ * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
+ */
+#ifdef _M_ARM64
+
+static __forceinline void
+spin_delay(void)
+{
+	 /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
+	__isb(_ARM64_BARRIER_SY);
+}
+#else
+/*
+ * For x64, use _mm_pause intrinsic instead of rep nop.
+ */
static __forceinline void
spin_delay(void)
{
_mm_pause();
}

This continues to use a barrier, with a reference to a list of barrier
semantics that really don't seem to make a whole lot of sense in the context
of spin_delay(). If we want to emit this kind of barrier for now it's ok with
me, but it should be documented as just being a fairly random choice, rather
than a link that doesn't explain anything.

+#endif
#else
static __forceinline void
spin_delay(void)
@@ -623,9 +640,13 @@ spin_delay(void)
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)

-#define S_UNLOCK(lock)	\
+#ifdef _M_ARM64
+#define S_UNLOCK(lock) \
+	do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
+#else

This doesn't seem like the right way to implement this - why not use
InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Greetings,

Andres Freund

#22Greg Burd
Greg Burd
greg@burd.me
In reply to: Andres Freund (#21)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:

Hi,

Thanks again for taking a look at the patch, hopefully I got it right this time. :)

On 2025-11-24 11:28:28 -0500, Greg Burd wrote:

@@ -2509,25 +2513,64 @@ int main(void)
}
'''

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
-      args: test_c_args)
-    # Use ARM CRC Extension unconditionally
-    cdata.set('USE_ARMV8_CRC32C', 1)
-    have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
-      args: test_c_args + ['-march=armv8-a+crc+simd'])
-    # Use ARM CRC Extension, with runtime check
-    cflags_crc += '-march=armv8-a+crc+simd'
-    cdata.set('USE_ARMV8_CRC32C', false)
-    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
-    have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
-      args: test_c_args + ['-march=armv8-a+crc'])
-    # Use ARM CRC Extension, with runtime check
-    cflags_crc += '-march=armv8-a+crc'
-    cdata.set('USE_ARMV8_CRC32C', false)
-    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
-    have_optimized_crc = true
+  if cc.get_id() == 'msvc'
+    # MSVC: Intrinsic availability check for ARM64
+    if host_machine.cpu_family() == 'aarch64'
+      # Test if CRC32C intrinsics are available in intrin.h
+      crc32c_test_msvc = '''
+        #include <intrin.h>
+        int main(void) {
+          uint32_t crc = 0;
+          uint8_t data = 0;
+          crc = __crc32cb(crc, data);
+          return 0;
+        }
+      '''
+      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
+        cdata.set('USE_ARMV8_CRC32C', 1)
+        have_optimized_crc = true
+        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
+      else
+        message('CRC32C intrinsics not available on this MSVC ARM64 build')
+      endif

Does this:
a) need to be conditional at all, given that it's msvc specific, it seems we
don't need to run a test?
b) why is the msvc block outside of the general aarch64 block but then has
another nested aarch64 test inside? That seems unnecessarily complicated and
requires reindenting unnecessarily much code?

Yep, I rushed this. Apologies. I've re-worked it with your suggestions.

+/*
+ * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
+ */
+#ifdef _M_ARM64
+
+static __forceinline void
+spin_delay(void)
+{
+	 /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
+	__isb(_ARM64_BARRIER_SY);
+}
+#else
+/*
+ * For x64, use _mm_pause intrinsic instead of rep nop.
+ */
static __forceinline void
spin_delay(void)
{
_mm_pause();
}

This continues to use a barrier, with a reference to a list of barrier
semantics that really don't seem to make a whole lot of sense in the context
of spin_delay(). If we want to emit this kind of barrier for now it's ok with
me, but it should be documented as just being a fairly random choice, rather
than a link that doesn't explain anything.

I did more digging and found that you were right about the use of ISB for spin_delay(). I think I was misled by earlier code in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with that not doing enough research myself. So I did more digging and found an article on this [1]https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm and it seems that YIELD should be used, not ISB. I checked into how others implement this feature, Java [2]https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html[3]https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html uses YIELD, Linux [4]https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h[5]https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403 uses YIELD in cpu_relax() called by __delay().

+#endif
#else
static __forceinline void
spin_delay(void)
@@ -623,9 +640,13 @@ spin_delay(void)
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)

-#define S_UNLOCK(lock)	\
+#ifdef _M_ARM64
+#define S_UNLOCK(lock) \
+	do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
+#else

This doesn't seem like the right way to implement this - why not use
InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Great idea, done. Seems to work too.

Greetings,

Andres Freund

Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the non-MSVC path and switch it to YIELD, what do you think?

v5 attached, best.

-greg

[1]: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
[2]: https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
[3]: https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
[4]: https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
[5]: https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403

Attachments:

v5-0001-Enable-build-on-Windows-11-ARM64-with-MSVC.patchapplication/octet-stream; name="=?UTF-8?Q?v5-0001-Enable-build-on-Windows-11-ARM64-with-MSVC.patch?="
#23Greg Burd
Greg Burd
greg@burd.me
In reply to: Greg Burd (#22)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Hello.

Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my aarch64 Win11 MSVC system. Also note that this system I'm testing on is ready to become a member of the buildfarm (application submitted) and monitor this combo in perpetuity.

FWIW, Dave's bug report to Microsoft [1]https://developercommunity.visualstudio.com/t/MSVCs-_InterlockedCompareExchange-doe/11004239 seems to be a duplicate of [2]https://developercommunity.visualstudio.com/t/MSVC-compiler-optimization-in-PostgreSQL/10982137 and the duplicate is about to be auto-closed.

best.

-greg

[1]: https://developercommunity.visualstudio.com/t/MSVCs-_InterlockedCompareExchange-doe/11004239
[2]: https://developercommunity.visualstudio.com/t/MSVC-compiler-optimization-in-PostgreSQL/10982137

Attachments:

v6-0001-Enable-build-on-Windows-11-ARM64-with-MSVC.patchapplication/octet-stream; name="=?UTF-8?Q?v6-0001-Enable-build-on-Windows-11-ARM64-with-MSVC.patch?="
#24Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Greg Burd (#23)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Thu, Dec 11, 2025 at 5:32 AM Greg Burd <greg@burd.me> wrote:

Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my aarch64 Win11 MSVC system. Also note that this system I'm testing on is ready to become a member of the buildfarm (application submitted) and monitor this combo in perpetuity.

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
__crc32cd without -march=armv8-a+crc',
...
+  if host_machine.cpu_family() == 'aarch64'

I think this new nesting of the CRC32 feature tests breaks the test on
"armv7" distros (in our build farm, that's a bunch of RPis running
Debian/Raspbian, but at least FreeBSD and NetBSD also support
"armv7"). Any ARM chip made since around 2011 is really an ARMv8+
chip running Aarch32 code and can thus reach the ARMv8 instructions.
For example "grison" says:

checking build system type... (cached) armv7l-unknown-linux-gnueabihf
...
checking which CRC-32C implementation to use... ARMv8 CRC instructions
with runtime check

-#define S_UNLOCK(lock)    \
+#define S_UNLOCK(lock) \

Bogus whitespace change.

#25Greg Burd
Greg Burd
greg@burd.me
In reply to: Thomas Munro (#24)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

On Wed, Dec 10, 2025, at 4:31 PM, Thomas Munro wrote:

On Thu, Dec 11, 2025 at 5:32 AM Greg Burd <greg@burd.me> wrote:

Rebased with only minor changes to meson.build this patch is ready for review/commit as it is passing tests on my aarch64 Win11 MSVC system. Also note that this system I'm testing on is ready to become a member of the buildfarm (application submitted) and monitor this combo in perpetuity.

-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
__crc32cd without -march=armv8-a+crc',
...
+  if host_machine.cpu_family() == 'aarch64'

Thanks for taking the time to review.

I think this new nesting of the CRC32 feature tests breaks the test on
"armv7" distros (in our build farm, that's a bunch of RPis running
Debian/Raspbian, but at least FreeBSD and NetBSD also support
"armv7"). Any ARM chip made since around 2011 is really an ARMv8+
chip running Aarch32 code and can thus reach the ARMv8 instructions.
For example "grison" says:

checking build system type... (cached) armv7l-unknown-linux-gnueabihf
...
checking which CRC-32C implementation to use... ARMv8 CRC instructions
with runtime check

That was rather silly of me to overlook, apologies. I went back to the code that was there before written by Andres I believe and simply added a test for this one special case. That reduces code churn and simplifies the logic while preserving the earlier behavior.

As I was re-reading I decided to review the YEILD vs ISB question for this platform combo again in light of a82a5ee [1]/messages/by-id/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com. What I found was a wealth of information and work on the topic and the thread mentioned MariaDB so I did some digging (because, why not?). Go-lang[2]https://github.com/golang/go/issues/69232 uses ISB but a comment there led me on to another resource, a blog post from ARM on the topic of spin locks for ARM64[3]https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm. I found that like Go-lang, Rust[4]https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807 uses ISB (and if you read only one thread on the topic this is the one to dig into). The developers on the thread wrote a number of benchmarks and tested a variety of approaches landing on "ISB SY" on aarch64. I found that MySQL[5]https://bugs.mysql.com/bug.php?id=100664[6]https://github.com/mysql/mysql-server/pull/305/files also went with ISB, as has Cloudera[7]https://gerrit.cloudera.org/#/c/19865/, Impala[8]https://issues.apache.org/jira/browse/IMPALA-12122, Aerospike[9]https://github.com/aerospike/aerospike-common/pull/17, and WebAssembly[10]https://github.com/WebAssembly/threads/issues/15 at which point I stopped looking around. There were other resources out there for understanding ARM and relaxed memory models such as [11]https://developer.arm.com/documentation/genc007826/latest "Barrier Litmus Tests and Cookbook", [12]http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf - Memory Barriers: a Hardware View for Software Hackers, and [13]https://randomascii.wordpress.com/2020/11/29/arm-and-lock-free-programming/ for those that care to learn.

All of this is to say that I swapped out the YEILD for "ISB SY" for the ARM64/MSVC because that seems to make the most sense from the evidence I found and for consistency in our code.

-#define S_UNLOCK(lock)    \
+#define S_UNLOCK(lock) \

Bogus whitespace change.

Fixed, thanks.

best.

-greg

[1]: /messages/by-id/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com
[2]: https://github.com/golang/go/issues/69232
[3]: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
[4]: https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807
[5]: https://bugs.mysql.com/bug.php?id=100664
[6]: https://github.com/mysql/mysql-server/pull/305/files
[7]: https://gerrit.cloudera.org/#/c/19865/
[8]: https://issues.apache.org/jira/browse/IMPALA-12122
[9]: https://github.com/aerospike/aerospike-common/pull/17
[10]: https://github.com/WebAssembly/threads/issues/15
[11]: https://developer.arm.com/documentation/genc007826/latest "Barrier Litmus Tests and Cookbook"
[12]: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf - Memory Barriers: a Hardware View for Software Hackers
[13]: https://randomascii.wordpress.com/2020/11/29/arm-and-lock-free-programming/

Attachments:

v7-0001-Enable-the-Microsoft-Windows-ARM64-MSVC-platform.patchapplication/octet-stream; name="=?UTF-8?Q?v7-0001-Enable-the-Microsoft-Windows-ARM64-MSVC-platform.patch?="
#26Greg Burd
Greg Burd
greg@burd.me
In reply to: Greg Burd (#25)
1 attachment(s)
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Well, TIL something about Meson; you must escape multi-line statements with a backslash. Apologies for the oversight/noise.

best.

-greg

Attachments:

v8-0001-Enable-the-Microsoft-Windows-ARM64-MSVC-platform.patchapplication/octet-stream; name="=?UTF-8?Q?v8-0001-Enable-the-Microsoft-Windows-ARM64-MSVC-platform.patch?="