pg_atomic_compare_exchange_*() and memory barriers

Started by Alexander Korotkov10 months ago19 messages
#1Alexander Korotkov
aekorotkov@gmail.com

Hi all,

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me. That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

I've found that __atomic_compare_exchange_n() ends up with usage
of __aarch64_cas8_acq_rel, which disassembles as following.

0000000000000860 <__aarch64_cas8_acq_rel>:
860: d503245f bti c
864: 90000110 adrp x16, 20000 <__data_start>
868: 3940a210 ldrb w16, [x16, #40]
86c: 34000070 cbz w16, 878 <__aarch64_cas8_acq_rel+0x18>
870: c8e0fc41 casal x0, x1, [x2]
874: d65f03c0 ret
878: aa0003f0 mov x16, x0
87c: c85ffc40 ldaxr x0, [x2]
880: eb10001f cmp x0, x16
884: 54000061 b.ne 890 <__aarch64_cas8_acq_rel+0x30> // b.any
888: c811fc41 stlxr w17, x1, [x2]
88c: 35ffff91 cbnz w17, 87c <__aarch64_cas8_acq_rel+0x1c>
890: d65f03c0 ret

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case. In contrast,
__sync_val_compare_and_swap() uses __aarch64_cas8_sync, which is quite the
same as __aarch64_cas8_acq_rel, but has an explicit memory barrier in the
end.

I have a couple of ideas on how to fix this problem.
1. Provide full barrier semantics for pg_atomic_compare_exchange_*(). In
particular, for __atomic_compare_exchange_n(), we should probably
use __ATOMIC_ACQ_REL for success and run an explicit memory barrier in the
case of failure.
2. Document that pg_atomic_compare_exchange_*() doesn't necessarily provide
a memory barrier on failure. But then we need to carefully check if
existing use-cases need explicit memory barriers now.

Any thoughts?

Links
1.
/messages/by-id/CAPpHfdsWcQb-u-9K=ipneBf8CMhoUuBWKYc+XWJEHVdtONOepQ@mail.gmail.com
2. https://developer.arm.com/documentation/100941/0101/Barriers

------
Regards,
Alexander Korotkov
Supabase

#2Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#1)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

__ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those
it's more clearly formulated that that include acquire/release semantics. The
standard formulation is a bit more complicated IIRC, but here's the
cppreference.com simplification:

https://en.cppreference.com/w/c/atomic/memory_order

A load operation with this memory order performs an acquire operation, a
store performs a release operation, and read-modify-write performs both an
acquire operation and a release operation, plus a single total order exists
in which all threads observe all modifications in the same order (see
Sequentially-consistent ordering below).

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I have a couple of ideas on how to fix this problem.
1. Provide full barrier semantics for pg_atomic_compare_exchange_*(). In
particular, for __atomic_compare_exchange_n(), we should probably
use __ATOMIC_ACQ_REL for success and run an explicit memory barrier in the
case of failure.

I don't follow - __ATOMIC_SEQ_CST is *stronger* than __ATOMIC_ACQ_REL.

Greetings,

Andres Freund

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#2)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi, Andres.

Thank you for reply.

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

__ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those
it's more clearly formulated that that include acquire/release semantics. The
standard formulation is a bit more complicated IIRC, but here's the
cppreference.com simplification:

https://en.cppreference.com/w/c/atomic/memory_order

A load operation with this memory order performs an acquire operation, a
store performs a release operation, and read-modify-write performs both an
acquire operation and a release operation, plus a single total order exists
in which all threads observe all modifications in the same order (see
Sequentially-consistent ordering below).

Thank you. This is not yet clear for me. I probably need to meditate
more on this :)

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I've tried and got the same results with two compilers.
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)

------
Regards,
Alexander Korotkov
Supabase

#4Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#3)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

Yes, if it's paired with another barrier on the other side. The regular
read/write operations are basically protected transitively, due to

a) An acquire barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to before the barrier

b) A release barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to after the barrier

c) The other thread being guaranteed a) and b) for the other threads'
non-atomic reads/writes depending on the type of barrier

d) The atomic value itself being guaranteed to be, well, atomic

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I've tried and got the same results with two compilers.
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)

Thinking more about it I wonder if the behaviour of not doing a release in
case the atomic fails *might* arguably actually be correct - if the
compare-exchange fails, there's nothing that the non-atomic values could be
ordered against.

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Greetings,

Andres Freund

#5Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Alexander Korotkov (#3)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi, Alexander and Andres!

On Fri, 7 Mar 2025 at 20:40, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Andres.

Thank you for reply.

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

According to a reference posted by Andres [1]https://en.cppreference.com/w/c/atomic/memory_order#Sequentially-consistent_ordering:
"Within a thread of execution, accesses (reads and writes) through
volatile lvalues cannot be reordered past observable side-effects
(including other volatile accesses) that are separated by a sequence
point within the same thread, but this order is not guaranteed to be
observed by another thread, since volatile access does not establish
inter-thread synchronization."

Also: "as soon as atomic operations that are not tagged
memory_order_seq_cst enter the picture, the sequential consistency is
lost"

So I think Alexander is right in the original post. The order of
everything not marked as ATOMIC_SEQ_CST (atomic or non-atomic
operations) compared to atomic operations marked by ATOMIC_SEQ_CST is
not warranted.

__ATOMIC_SEQ_CST is used to implement the C11/C++11 barriers, and for those
it's more clearly formulated that that include acquire/release semantics. The
standard formulation is a bit more complicated IIRC, but here's the
cppreference.com simplification:

https://en.cppreference.com/w/c/atomic/memory_order

A load operation with this memory order performs an acquire operation, a
store performs a release operation, and read-modify-write performs both an
acquire operation and a release operation, plus a single total order exists
in which all threads observe all modifications in the same order (see
Sequentially-consistent ordering below).

Thank you. This is not yet clear for me. I probably need to meditate
more on this :)

I think __ATOMIC_ACQ_REL is needed for success, and __ATOMIC_ACQUIRE
might be sufficient for failure as failure of
__atomic_compare_exchange_n contains no modification of atomic
variable, so we want to see all modifications from the concurrent
operations, but will not enforce them to see any "our" change. [2]https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Maybe I'm too optimistic here and having a full memory barrier is
better.

Thank you very much for considering this issue! I think the right
operation of atomic primitives is both very very important and also
hard to check thoroughly. I think the test that could reproduce the
reordering of atomic and non-atomic operations could be very useful in
regression set.

[1]: https://en.cppreference.com/w/c/atomic/memory_order#Sequentially-consistent_ordering
[2]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Best regards,
Pavel Borisov
Supabase

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#4)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of

WALBufMappingLock, I

found that the surrounding pg_atomic_compare_exchange_u64() fixes

the

problem for me.

That sounds more likely to be due to slowing down things enough to

make a race

less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64()

is based

on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected,

newval, false,

__ATOMIC_SEQ_CST,

__ATOMIC_SEQ_CST);

}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with

*all

other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular

reads/writes.

This sounds quite far from our comment, promising full barrier

semantics,

which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier

*all other

read/writes*.

A memory barrier in one process/thread always needs be paired with a

barrier

in another process/thread. It's not enough to have a memory barrier

on one

side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

Yes, if it's paired with another barrier on the other side. The regular
read/write operations are basically protected transitively, due to

a) An acquire barrier preventing non-atomic reads/writes in the same

thread

from appearing to have been moved to before the barrier

b) A release barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to after the barrier

c) The other thread being guaranteed a) and b) for the other threads'
non-atomic reads/writes depending on the type of barrier

d) The atomic value itself being guaranteed to be, well, atomic

Yes, looks good to me.

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both

ldaxr/stlxr

provides one-way barriers. Read/writes after ldaxr will be

observed after,

and read/writes before stlxr will be observed before. So, a pair

of these

instructions provides a full memory barrier. However, if we don't

observe

the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a

one-way

barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I've tried and got the same results with two compilers.
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)

Thinking more about it I wonder if the behaviour of not doing a release in
case the atomic fails *might* arguably actually be correct - if the
compare-exchange fails, there's nothing that the non-atomic values could

be

ordered against.

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend
which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2
(p2).

1. p1 executes l1
2. p2 executes l2 with success
3. p1 executes l2 with failure
4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't
provide enough of memory barrier

------
Regards,
Alexander Korotkov
Supabase

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#6)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Fri, Mar 7, 2025 at 7:15 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

Yes, if it's paired with another barrier on the other side. The regular
read/write operations are basically protected transitively, due to

a) An acquire barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to before the barrier

b) A release barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to after the barrier

c) The other thread being guaranteed a) and b) for the other threads'
non-atomic reads/writes depending on the type of barrier

d) The atomic value itself being guaranteed to be, well, atomic

Yes, looks good to me.

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I've tried and got the same results with two compilers.
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)

Thinking more about it I wonder if the behaviour of not doing a release in
case the atomic fails *might* arguably actually be correct - if the
compare-exchange fails, there's nothing that the non-atomic values could be
ordered against.

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend which
* will initialize nextidx.
*/
ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2 (p2).

1. p1 executes l1
2. p2 executes l2 with success
3. p1 executes l2 with failure
4. p2 execute l3, but doesn't see the results of step 1, because 3 didn't provide enough of memory barrier

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Note that step 2 would be successful if it would be after step 3.

------
Regards,
Alexander Korotkov
Supabase

#8Andres Freund
andres@anarazel.de
In reply to: Pavel Borisov (#5)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-07 21:08:34 +0400, Pavel Borisov wrote:

On Fri, 7 Mar 2025 at 20:40, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

According to a reference posted by Andres [1]:
"Within a thread of execution, accesses (reads and writes) through
volatile lvalues cannot be reordered past observable side-effects
(including other volatile accesses) that are separated by a sequence
point within the same thread, but this order is not guaranteed to be
observed by another thread, since volatile access does not establish
inter-thread synchronization."

How is volatile relevant here?

Also: "as soon as atomic operations that are not tagged
memory_order_seq_cst enter the picture, the sequential consistency is
lost"

Sequential consistency is lost, but that does *NOT* mean that acquire/release
guarantees that are also guaranteed by ATOMIC_SEQ_CST are lost.

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#6)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend
which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2
(p2).

On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote:

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Did you mean because 2) didn't provide enough of a memory barrier? Because 3)
does, right?

You could get in exactly same the situation if the p1 is scheduled out by the
OS after step 1, no?

Greetings,

Andres Freund

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#9)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend
which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2
(p2).

On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote:

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Did you mean because 2) didn't provide enough of a memory barrier? Because 3)
does, right?

Yes, exactly.

You could get in exactly same the situation if the p1 is scheduled out by the
OS after step 1, no?

No. In that case, p1 will execute l2 with success. p1 executes l2
with failure only because it goes before p2 executes l2.

------
Regards,
Alexander Korotkov
Supabase

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#4)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 18:39:42 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 6:02 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 17:47:08 +0200, Alexander Korotkov wrote:

While investigating a bug in the patch to get rid of WALBufMappingLock, I
found that the surrounding pg_atomic_compare_exchange_u64() fixes the
problem for me.

That sounds more likely to be due to slowing down things enough to make a race
less likely to be hit. Or a compiler bug.

That doesn't feel right because, according to the
comments, both pg_atomic_compare_exchange_u32() and
pg_atomic_compare_exchange_u64() should provide full memory barrier
semantics. So, I decided to investigate this further.

In my case, the implementation of pg_atomic_compare_exchange_u64() is based
on __atomic_compare_exchange_n().

static inline bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
uint64 *expected, uint64 newval)
{
AssertPointerAlignment(expected, 8);
return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

According to the docs __ATOMIC_SEQ_CST enforces total ordering with *all
other __ATOMIC_SEQ_CST operations*. It only says about other
__ATOMIC_SEQ_CST operations; nothing is said about regular reads/writes.
This sounds quite far from our comment, promising full barrier semantics,
which, in my understanding, means the equivalent of
both pg_read_barrier()/pg_write_barrier(), which should barrier *all other
read/writes*.

A memory barrier in one process/thread always needs be paired with a barrier
in another process/thread. It's not enough to have a memory barrier on one
side but not the other.

Yes, there surely should be a memory barrier on another side. But
does __ATOMIC_SEQ_CST works as a barrier for the regular read/write
operations on the same side?

Yes, if it's paired with another barrier on the other side. The regular
read/write operations are basically protected transitively, due to

a) An acquire barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to before the barrier

b) A release barrier preventing non-atomic reads/writes in the same thread
from appearing to have been moved to after the barrier

c) The other thread being guaranteed a) and b) for the other threads'
non-atomic reads/writes depending on the type of barrier

d) The atomic value itself being guaranteed to be, well, atomic

This function first checks if LSE instructions are present. If so,
instruction LSE casal is used. If not (in my case), the loop of
ldaxr/stlxr is used instead. The documentation says both ldaxr/stlxr
provides one-way barriers. Read/writes after ldaxr will be observed after,
and read/writes before stlxr will be observed before. So, a pair of these
instructions provides a full memory barrier. However, if we don't observe
the expected value, only ldaxr gets executed. So, an unsuccessful
pg_atomic_compare_exchange_u64() attempt will leave us with a one-way
barrier, and that caused a bug in my case.

That has to be a compiler bug. We specify __ATOMIC_SEQ_CST for both
success_memorder *and* failure_memorder.

What compiler & version is this?

I've tried and got the same results with two compilers.
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Ubuntu clang version 19.1.1 (1ubuntu1~24.04.2)

Thinking more about it I wonder if the behaviour of not doing a release in
case the atomic fails *might* arguably actually be correct - if the
compare-exchange fails, there's nothing that the non-atomic values could be
ordered against.

There is another successful compare-exchange executed by a concurrent
process to get ordered against.

------
Regards,
Alexander Korotkov
Supabase

#12Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#10)
Re: pg_atomic_compare_exchange_*() and memory barriers

On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend
which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2
(p2).

On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote:

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Did you mean because 2) didn't provide enough of a memory barrier? Because 3)
does, right?

Yes, exactly.

You could get in exactly same the situation if the p1 is scheduled out by the
OS after step 1, no?

No. In that case, p1 will execute l2 with success. p1 executes l2
with failure only because it goes before p2 executes l2.

In my scenario p1 will not execute l2 because p2 gets scheduled before it can
do so. So p1 cant yet execute l2 before p2 executes l2.

Greetings,

Andres Freund

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#12)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Fri, Mar 7, 2025 at 7:46 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de> wrote:

What is the access pattern and the observed problems with it that made you
look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous pages are not
* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on advancing of
* InitializedUpTo, this backend have to attempt advancing until it
* find page "in the past" or concurrent backend succeeded at
* advancing. When we finish advancing XLogCtl->InitializedUpTo, we
* notify all the waiters with XLogCtl->InitializedUpToCondVar.
*/
l2: while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,
&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we cann't move
* InitializedUpto further. It will be moved by backend
which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and process 2
(p2).

On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote:

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Did you mean because 2) didn't provide enough of a memory barrier? Because 3)
does, right?

Yes, exactly.

You could get in exactly same the situation if the p1 is scheduled out by the
OS after step 1, no?

No. In that case, p1 will execute l2 with success. p1 executes l2
with failure only because it goes before p2 executes l2.

In my scenario p1 will not execute l2 because p2 gets scheduled before it can
do so. So p1 cant yet execute l2 before p2 executes l2.

In my scenario p1.l2 will be successful if executed after p2.l2 and
failed if executed before p2.l2. Imagine initial value is 0. p1
tries to change 1=>2, while p2 tries to change 0 => 1.

------
Regards,
Alexander Korotkov
Supabase

#14Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#13)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi, Andres!

On Fri, Mar 7, 2025 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Fri, Mar 7, 2025 at 7:46 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-07 19:44:20 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:38 PM Andres Freund <andres@anarazel.de>

wrote:

On 2025-03-07 19:15:23 +0200, Alexander Korotkov wrote:

On Fri, Mar 7, 2025 at 7:07 PM Andres Freund <andres@anarazel.de>

wrote:

What is the access pattern and the observed problems with it

that made you

look at the disassembly?

Check this code.

l1: pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx],

NewPageEndPtr);

/*
* Try to advance XLogCtl->InitializedUpTo.
*
* If the CAS operation failed, then some of previous

pages are not

* initialized yet, and this backend gives up.
*
* Since initializer of next page might give up on

advancing of

* InitializedUpTo, this backend have to attempt

advancing until it

* find page "in the past" or concurrent backend

succeeded at

* advancing. When we finish advancing

XLogCtl->InitializedUpTo, we

* notify all the waiters with

XLogCtl->InitializedUpToCondVar.

*/
l2: while

(pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo,

&NewPageBeginPtr, NewPageEndPtr))
{
NewPageBeginPtr = NewPageEndPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);

l3: if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) !=
NewPageEndPtr)
{
/*
* Page at nextidx wasn't initialized yet, so we

cann't move

* InitializedUpto further. It will be moved by

backend

which
* will initialize nextidx.
*/

ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
break;
}
}

Consider the following execution order with process 1 (p1) and

process 2

(p2).

On 2025-03-07 19:24:39 +0200, Alexander Korotkov wrote:

Sorry, I messed this up.
The correct sequence is following.

1. p1 executes l1
2. p1 executes l2 with failure
3. p2 executes l2 with success
4. p2 execute l3, but doesn't see the results of step 1, because 3
didn't provide enough of memory barrier

Did you mean because 2) didn't provide enough of a memory barrier?

Because 3)

does, right?

Yes, exactly.

You could get in exactly same the situation if the p1 is scheduled

out by the

OS after step 1, no?

No. In that case, p1 will execute l2 with success. p1 executes l2
with failure only because it goes before p2 executes l2.

In my scenario p1 will not execute l2 because p2 gets scheduled before

it can

do so. So p1 cant yet execute l2 before p2 executes l2.

In my scenario p1.l2 will be successful if executed after p2.l2 and
failed if executed before p2.l2. Imagine initial value is 0. p1
tries to change 1=>2, while p2 tries to change 0 => 1.

My explanation was cumbersome form the beginning. Let me come with more
clear example.

Let us have a shared memory variable v, and shared memory flag f. The
initial state is as following.
v = 0
f = false

Two processes p1 and p2 runs the following programs in pseudocode. Atomic
compare-exchange operation is represented as CAS(variable, oldval,
newval). In this example we don't need to update oldval on failure or even
know whether operation is successful.

p1:
CAS(v, 0, 1)
if f:
CAS(v, 1, 2)

p2:
f = true
CAS(v, 1, 2)

I think if CAS() implements full barrier semantics, the invariant is that v
eventually get the value 2. If CAS(v, 1, 2) by p2 goes before CAS(v, 0, 1)
by p1, p1 will reliably see f == true and apply CAS(v, 1, 2).

If CAS() don't implement full barrier on failure this invariant gets broken
(that is CAS() by p2 fails, but p1 see f == false).

I'm not an expert in formal specifications of memory models. But I'm quite
surprised we're discussing whether memory barrier on compare-exchange
failure might matter. For me at least the fact
that __atomic_compare_exchange_n() have failure_memorder argument is a
quite an evidence of that.

------
Regards,
Alexander Korotkov
Supabase

#15Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#14)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-08 14:12:13 +0200, Alexander Korotkov wrote:

I'm not an expert in formal specifications of memory models. But I'm quite
surprised we're discussing whether memory barrier on compare-exchange
failure might matter. For me at least the fact
that __atomic_compare_exchange_n() have failure_memorder argument is a
quite an evidence of that.

I wasn't trying to say that the failure memory order doesn't matter, just that
an *acquire* barrier might be strong enough in the failure case if you look at
it from the POV of C++/C11's memory model. The docs for
__atomic_compare_exchange_n say:

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#index-_005f_005fatomic_005fcompare_005fexchange_005fn

Otherwise, false is returned and memory is affected according to
failure_memorder. This memory order cannot be __ATOMIC_RELEASE nor
__ATOMIC_ACQ_REL. It also cannot be a stronger order than that specified by
success_memorder.

Note that the generated code you showed *did* unconditionally execute the load
with acquire semantics.

Which means that that one can argue that this is *NOT* a compiler bug.

From the C/C++ standard atomics model it doesn't make sense to say that a
failed CAS has release semantics, as there simply isn't a write that could be
ordered! What their barriers guarantee is ordering between multiple memory
operation, you can't order multiple writes if you don't have multiple
writes... The synchronization in the C/C++ model is only established between
accesses of the same variable and there's no write in the case of a failed
CAS, so there's nothing that could establish a release-acquire ordering.

Unfortunately that model doesn't mesh well with barriers that aren't attached
to read/modify operations. Which is what we ended up with...

Greetings,

Andres Freund

#16Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#15)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-08 08:02:41 -0500, Andres Freund wrote:

From the C/C++ standard atomics model it doesn't make sense to say that a
failed CAS has release semantics, as there simply isn't a write that could be
ordered! What their barriers guarantee is ordering between multiple memory
operation, you can't order multiple writes if you don't have multiple
writes... The synchronization in the C/C++ model is only established between
accesses of the same variable and there's no write in the case of a failed
CAS, so there's nothing that could establish a release-acquire ordering.

Unfortunately that model doesn't mesh well with barriers that aren't attached
to read/modify operations. Which is what we ended up with...

Adding a full barrier to failed CAS would be a rather large overhead,
undesirable in just about any sane algorithm. As a consequence, I think what
we ought to do is to redefine the barrier semantics to only imply an acquire
barrier in case CAS fails.

Greetings,

Andres Freund

#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#16)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Sat, Mar 8, 2025 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-08 08:02:41 -0500, Andres Freund wrote:

From the C/C++ standard atomics model it doesn't make sense to say that a
failed CAS has release semantics, as there simply isn't a write that could be
ordered! What their barriers guarantee is ordering between multiple memory
operation, you can't order multiple writes if you don't have multiple
writes... The synchronization in the C/C++ model is only established between
accesses of the same variable and there's no write in the case of a failed
CAS, so there's nothing that could establish a release-acquire ordering.

Unfortunately that model doesn't mesh well with barriers that aren't attached
to read/modify operations. Which is what we ended up with...

Adding a full barrier to failed CAS would be a rather large overhead,
undesirable in just about any sane algorithm. As a consequence, I think what
we ought to do is to redefine the barrier semantics to only imply an acquire
barrier in case CAS fails.

Thank you, I'm good with this solution. Can I leave this on you? I'm
not feeling myself strong to word this correctly.

------
Regards,
Alexander Korotkov
Supabase

#18Andres Freund
andres@anarazel.de
In reply to: Alexander Korotkov (#17)
Re: pg_atomic_compare_exchange_*() and memory barriers

Hi,

On 2025-03-08 17:06:38 +0200, Alexander Korotkov wrote:

On Sat, Mar 8, 2025 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-08 08:02:41 -0500, Andres Freund wrote:

From the C/C++ standard atomics model it doesn't make sense to say that a
failed CAS has release semantics, as there simply isn't a write that could be
ordered! What their barriers guarantee is ordering between multiple memory
operation, you can't order multiple writes if you don't have multiple
writes... The synchronization in the C/C++ model is only established between
accesses of the same variable and there's no write in the case of a failed
CAS, so there's nothing that could establish a release-acquire ordering.

Unfortunately that model doesn't mesh well with barriers that aren't attached
to read/modify operations. Which is what we ended up with...

Adding a full barrier to failed CAS would be a rather large overhead,
undesirable in just about any sane algorithm. As a consequence, I think what
we ought to do is to redefine the barrier semantics to only imply an acquire
barrier in case CAS fails.

Thank you, I'm good with this solution. Can I leave this on you? I'm
not feeling myself strong to word this correctly.

Not in the next ~four weeks. If you ping me afterwards, I can give it a go.

FWIW, I am fairly certain that any non-toy algorithm that requires a full
memory barrier instead of just an acquire in case of a CAS failure is chock
full of concurrency bugs.

Greetings,

Andres Freund

#19James Hunter
james.hunter.pg@gmail.com
In reply to: Andres Freund (#18)
Re: pg_atomic_compare_exchange_*() and memory barriers

On Sat, Mar 8, 2025 at 7:21 AM Andres Freund <andres@anarazel.de> wrote:

FWIW, I am fairly certain that any non-toy algorithm that requires a full
memory barrier instead of just an acquire in case of a CAS failure is chock
full of concurrency bugs.

Yeah -- off the top of my head, I can think of only two CAS patterns:
(1) retry the CAS until success (in which case the memory semantics of
a CAS failure don't matter); or (2) whoever wins the CAS is
responsible for doing some work. But, in (2), there's no reason to
expect that the "winner" has *completed* the work, so the memory
semantics of a CAS failure don't matter, since you need some other way
to say that the work has been completed.

Barriers are useful for seqlocks [1]https://en.wikipedia.org/wiki/Seqlock, which (IIRC) is the same
technique PostgreSQL uses for PG_STAT_BEGIN_{read,WRITE}_ACTIVITY. But
that's when you check the control (sequence) variable both before and
*after* touching the data it protects.

James

[1]: https://en.wikipedia.org/wiki/Seqlock