Atomic operations within spinlocks

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#1)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-03 14:19:45 -0400, Tom Lane wrote:

In connection with the nearby thread about spinlock coding rule
violations, I noticed that we have several places that are doing
things like this:

SpinLockAcquire(&WalRcv->mutex);
...
written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
...
SpinLockRelease(&WalRcv->mutex);

That's from pg_stat_get_wal_receiver(); there is similar code in
freelist.c's ClockSweepTick() and StrategySyncStart().

This seems to me to be very bad code. In the first place, on machines
without the appropriate type of atomic operation, atomics.c is going
to be using a spinlock to emulate atomicity, which means this code
tries to take a spinlock while holding another one. That's not okay,
either from the standpoint of performance or error-safety.

I'm honestly not particularly concerned about either of those in
general:

- WRT performance: Which platforms where we care about performance can't
do a tear-free read of a 64bit integer, and thus needs a spinlock for
this? And while the cases in freelist.c aren't just reads, they are
really cold paths (clock wrapping around).
- WRT error safety: What could happen here? The atomics codepaths is
no-fail code? And nothing should ever nest inside the atomic-emulation
spinlocks. What am I missing?

I think straight out prohibiting use of atomics inside a spinlock will
lead to more complicated and slower code, rather than than improving
anything. I'm to blame for the ClockSweepTick() case, and I don't really
see how we could avoid the atomic-while-spinlocked without relying on
64bit atomics, which are emulated on more platforms, and without having
a more complicated retry loop.

In the second place, this coding seems to me to indicate serious
confusion about which lock is protecting what. In the above example,
if writtenUpto is only accessed through atomic operations then it
seems like we could just move the pg_atomic_read_u64 out of the
spinlock section; or if the spinlock is adequate protection then we
could just do a normal fetch. If we actually need both locks then
this needs significant re-thinking, IMO.

Yea, it doesn't seem necessary at all that writtenUpto is read with the
spinlock held. That's very recent:

commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b
Author: Michael Paquier <michael@paquier.xyz>
Date: 2020-05-17 09:22:07 +0900

Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info

and I assume just was caused by mechanical replacement, rather than
intentionally doing so while holding the spinlock. As far as I can tell
none of the other writtenUpto accesses are under the spinlock.

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Andres Freund (#2)

Re: Atomic operations within spinlocks

On Thu, Jun 4, 2020 at 8:45 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-06-03 14:19:45 -0400, Tom Lane wrote:

In the second place, this coding seems to me to indicate serious
confusion about which lock is protecting what. In the above example,
if writtenUpto is only accessed through atomic operations then it
seems like we could just move the pg_atomic_read_u64 out of the
spinlock section; or if the spinlock is adequate protection then we
could just do a normal fetch. If we actually need both locks then
this needs significant re-thinking, IMO.

Yea, it doesn't seem necessary at all that writtenUpto is read with the
spinlock held. That's very recent:

commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b
Author: Michael Paquier <michael@paquier.xyz>
Date: 2020-05-17 09:22:07 +0900

Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info

and I assume just was caused by mechanical replacement, rather than
intentionally doing so while holding the spinlock. As far as I can tell
none of the other writtenUpto accesses are under the spinlock.

Yeah. It'd be fine to move that after the spinlock release. Although
it's really just for informational purposes only, not for any data
integrity purpose, reading it before the spinlock acquisition would
theoretically allow it to appear to be (reportedly) behind
flushedUpto, which would be silly.

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Thomas Munro (#3)

Re: Atomic operations within spinlocks

On Thu, Jun 04, 2020 at 09:40:31AM +1200, Thomas Munro wrote:

Yeah. It'd be fine to move that after the spinlock release. Although
it's really just for informational purposes only, not for any data
integrity purpose, reading it before the spinlock acquisition would
theoretically allow it to appear to be (reportedly) behind
flushedUpto, which would be silly.

Indeed. This could just be done after the spinlock section. Sorry
about that.
--
Michael

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#2)

Re: Atomic operations within spinlocks

Andres Freund <andres@anarazel.de> writes:

On 2020-06-03 14:19:45 -0400, Tom Lane wrote:

This seems to me to be very bad code.

I think straight out prohibiting use of atomics inside a spinlock will
lead to more complicated and slower code, rather than than improving
anything. I'm to blame for the ClockSweepTick() case, and I don't really
see how we could avoid the atomic-while-spinlocked without relying on
64bit atomics, which are emulated on more platforms, and without having
a more complicated retry loop.

Well, if you don't want to touch freelist.c then there is no point
worrying about what pg_stat_get_wal_receiver is doing. But having
now studied that freelist.c code, I don't like it one bit. It's
overly complicated and not very accurately commented, making it
*really* hard to convince oneself that it's not buggy.

I think a good case could be made for ripping out what's there now
and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
never reset (ie, make its comment actually true). Then ClockSweepTick
reduces to

pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers

and when we want to know how many cycles have been completed, we just
divide the counter by NBuffers. Now admittedly, this is relying on both
pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but
to throw your own argument back at you, do we really care anymore about
performance on machines where those things aren't true? Furthermore,
since all this is happening in a code path that's going to lead to I/O,
I'm not exactly convinced that a few nanoseconds matter anyway.

regards, tom lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#5)

Re: Atomic operations within spinlocks

I wrote:

I think a good case could be made for ripping out what's there now
and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
never reset (ie, make its comment actually true). Then ClockSweepTick
reduces to
pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers
and when we want to know how many cycles have been completed, we just
divide the counter by NBuffers.

Actually ... we could probably use this design with a uint32 counter
as well, on machines where the 64-bit operations would be slow.
In that case, integer overflow of nextVictimBuffer would happen from
time to time, resulting in

1. The next actual victim buffer index would jump strangely. This
doesn't seem like it'd matter at all, as long as it was infrequent.

2. The computed completePasses value would go backwards. I bet
that wouldn't matter too much either, or at least we could teach
BgBufferSync to cope. (I notice the comments therein suggest that
it is already designed to cope with completePasses wrapping around,
so maybe nothing needs to be done.)

If NBuffers was large enough to be a significant fraction of UINT_MAX,
then these corner cases would happen often enough to perhaps be
problematic. But I seriously doubt that'd be possible on hardware
that wasn't capable of using the 64-bit code path.

regards, tom lane

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#5)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-04 13:57:19 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-06-03 14:19:45 -0400, Tom Lane wrote:

This seems to me to be very bad code.

I think straight out prohibiting use of atomics inside a spinlock will
lead to more complicated and slower code, rather than than improving
anything. I'm to blame for the ClockSweepTick() case, and I don't really
see how we could avoid the atomic-while-spinlocked without relying on
64bit atomics, which are emulated on more platforms, and without having
a more complicated retry loop.

Well, if you don't want to touch freelist.c then there is no point
worrying about what pg_stat_get_wal_receiver is doing. But having
now studied that freelist.c code, I don't like it one bit. It's
overly complicated and not very accurately commented, making it
*really* hard to convince oneself that it's not buggy.

I think a good case could be made for ripping out what's there now
and just redefining nextVictimBuffer as a pg_atomic_uint64 that we
never reset (ie, make its comment actually true). Then ClockSweepTick
reduces to

Note that we can't do that in the older back branches, there wasn't any
64bit atomics fallback before

commit e8fdbd58fe564a29977f4331cd26f9697d76fc40
Author: Andres Freund <andres@anarazel.de>
Date: 2017-04-07 14:44:47 -0700

Improve 64bit atomics support.

pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1) % NBuffers

and when we want to know how many cycles have been completed, we just
divide the counter by NBuffers. Now admittedly, this is relying on both
pg_atomic_fetch_add_u64 being fast and int64 division being fast ... but
to throw your own argument back at you, do we really care anymore about
performance on machines where those things aren't true? Furthermore,
since all this is happening in a code path that's going to lead to I/O,
I'm not exactly convinced that a few nanoseconds matter anyway.

It's very easy to observe this code being a bottleneck. If we only
performed a single clock tick before IO, sure, then the cost would
obviously be swamped by the IO cost. But it's pretty common to end up
having to do that ~ NBuffers * 5 times for a single buffer.

I don't think it's realistic to rely on 64bit integer division being
fast in this path. The latency is pretty darn significant (64bit div is
35-88 cycles on skylake-x, 64bit idiv 42-95). And unless I
misunderstand, you'd have to do so (for % NBuffers) every single clock
tick, not just when we wrap around.

We could however avoid the spinlock if we were to use 64bit atomics, by
storing the number of passes in the upper 32bit, and the next victim
buffer in the lower. But that doesn't seem that simple either, and will
regress performance on 32bit platforms.

I don't the whole strategy logic at all, so I guess there's some
argument to improve things from that end. It's probably possible to
avoid the lock with 32bit atomics as well.

I'd still like to know which problem we're actually trying to solve
here. I don't understand the "error" issues you mentioned upthread.

Greetings,

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#6)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-04 14:50:40 -0400, Tom Lane wrote:

Actually ... we could probably use this design with a uint32 counter
as well, on machines where the 64-bit operations would be slow.

On skylake-x even a 32bit [i]div is still 26 cycles. That's more than an
atomic operation 18 cycles.

2. The computed completePasses value would go backwards. I bet
that wouldn't matter too much either, or at least we could teach
BgBufferSync to cope. (I notice the comments therein suggest that
it is already designed to cope with completePasses wrapping around,
so maybe nothing needs to be done.)

If we're not concerned about that, then we can remove the
atomic-inside-spinlock, I think. The only reason for that right now is
to avoid assuming a wrong pass number.

I don't think completePasses wrapping around is comparable in frequency
to wrapping around nextVictimBuffer. It's not really worth worrying
about bgwriter wrongly assuming it lapped the clock sweep once ever
UINT32_MAX * NBuffers ticks, but there being a risk every NBuffers seems
worth worrying about.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#7)

Re: Atomic operations within spinlocks

Andres Freund <andres@anarazel.de> writes:

I'd still like to know which problem we're actually trying to solve
here. I don't understand the "error" issues you mentioned upthread.

If you error out of getting the inner spinlock, the outer spinlock
is stuck, permanently, because there is no mechanism for spinlock
release during transaction abort. Admittedly it's not very likely
for the inner acquisition to fail, but it's possible. Aside from
timeout scenarios (e.g., process holding lock gets swapped out to
Timbuktu), it could be that both spinlocks are mapped onto a single
implementation lock by spin.c, which notes

* We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
* It's okay to map multiple spinlocks onto one semaphore because no process
* should ever hold more than one at a time.

You've falsified that argument ... and no, I don't want to upgrade
the spinlock infrastructure enough to make this OK. We shouldn't
ever be holding spinlocks long enough, or doing anything complicated
enough inside them, for such an upgrade to have merit.

regards, tom lane

#10

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#8)

Re: Atomic operations within spinlocks

Andres Freund <andres@anarazel.de> writes:

On 2020-06-04 14:50:40 -0400, Tom Lane wrote:

2. The computed completePasses value would go backwards. I bet
that wouldn't matter too much either, or at least we could teach
BgBufferSync to cope. (I notice the comments therein suggest that
it is already designed to cope with completePasses wrapping around,
so maybe nothing needs to be done.)

If we're not concerned about that, then we can remove the
atomic-inside-spinlock, I think. The only reason for that right now is
to avoid assuming a wrong pass number.

Hmm. That might be a less-invasive path to a solution. I can take
a look, if you don't want to.

regards, tom lane

#11

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#9)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-04 15:07:34 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'd still like to know which problem we're actually trying to solve
here. I don't understand the "error" issues you mentioned upthread.

If you error out of getting the inner spinlock, the outer spinlock
is stuck, permanently, because there is no mechanism for spinlock
release during transaction abort. Admittedly it's not very likely
for the inner acquisition to fail, but it's possible.

We PANIC on stuck spinlocks, so I don't think that's a problem.

* We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
* It's okay to map multiple spinlocks onto one semaphore because no process
* should ever hold more than one at a time.

You've falsified that argument ... and no, I don't want to upgrade
the spinlock infrastructure enough to make this OK.

Well, theoretically we take care to avoid this problem. That's why we
have a separate define for spinlocks and atomics:

/*
* When we don't have native spinlocks, we use semaphores to simulate them.
* Decreasing this value reduces consumption of OS resources; increasing it
* may improve performance, but supplying a real spinlock implementation is
* probably far better.
*/
#define NUM_SPINLOCK_SEMAPHORES 128

/*
* When we have neither spinlocks nor atomic operations support we're
* implementing atomic operations on top of spinlock on top of semaphores. To
* be safe against atomic operations while holding a spinlock separate
* semaphores have to be used.
*/
#define NUM_ATOMICS_SEMAPHORES 64

and

#ifndef HAVE_SPINLOCKS

/*
* NB: If we're using semaphore based TAS emulation, be careful to use a
* separate set of semaphores. Otherwise we'd get in trouble if an atomic
* var would be manipulated while spinlock is held.
*/
s_init_lock_sema((slock_t *) &ptr->sema, true);
#else
SpinLockInit((slock_t *) &ptr->sema);
#endif

But it looks like that code is currently buggy (and looks like it always
has been), because we don't look at the nested argument when
initializing the semaphore. So we currently allocate too many
semaphores, without benefiting from them :(.

We shouldn't ever be holding spinlocks long enough, or doing anything
complicated enough inside them, for such an upgrade to have merit.

Well, I don't think atomic instructions are that complicated. And I
think prohibiting atomics-within-spinlock adds a problematic
restriction, without much in the way of benefits:

There's plenty things where it's somewhat easy to make the fast-path
lock-free, but the slow path still needs a lock (e.g. around a
freelist). And for those it's really useful to still be able to have a
coherent update to an atomic variable, to synchronize with the fast-path
that doesn't take the spinlock.

Greetings,

Andres Freund

#12

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#10)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-04 15:13:29 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-06-04 14:50:40 -0400, Tom Lane wrote:

2. The computed completePasses value would go backwards. I bet
that wouldn't matter too much either, or at least we could teach
BgBufferSync to cope. (I notice the comments therein suggest that
it is already designed to cope with completePasses wrapping around,
so maybe nothing needs to be done.)

If we're not concerned about that, then we can remove the
atomic-inside-spinlock, I think. The only reason for that right now is
to avoid assuming a wrong pass number.

Hmm. That might be a less-invasive path to a solution. I can take
a look, if you don't want to.

First, I think it would be problematic:

/*
* Find out where the freelist clock sweep currently is, and how many
* buffer allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
...

/*
* Compute strategy_delta = how many buffers have been scanned by the
* clock sweep since last time. If first time through, assume none. Then
* see if we are still ahead of the clock sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
int32 passes_delta = strategy_passes - prev_strategy_passes;

strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;

Assert(strategy_delta >= 0);

ISTM that if we can get an out-of-sync strategy_passes and
strategy_buf_id we'll end up with a pretty wrong strategy_delta. Which,
I think, can cause to reset bgwriter's position:
else
{
/*
* We're behind, so skip forward to the strategy point and start
* cleaning from there.
*/
#ifdef BGW_DEBUG
elog(DEBUG2, "bgwriter behind: bgw %u-%u strategy %u-%u delta=%ld",
next_passes, next_to_clean,
strategy_passes, strategy_buf_id,
strategy_delta);
#endif
next_to_clean = strategy_buf_id;
next_passes = strategy_passes;
bufs_to_lap = NBuffers;
}

While I think that the whole logic in BgBufferSync doesn't make a whole
lot of sense, it does seem to me this has a fair potential to make it
worse. In a scenario with a decent cache hit ratio (leading to high
usagecounts) and a not that large NBuffers, we can end up up doing quite
a few passes (as in many a second), so it might not be that hard to hit
this.

I am not immediately coming up with a cheap solution that doesn't do the
atomics-within-spinlock thing. The best I can come up with is using a
64bit atomic, with the upper 32bit containing the number of passes, and
the lower 32bit containing the current buffer. Where the lower 32bit /
the buffer is handled like it currently is, i.e. we "try" to keep it
below NBuffers. So % is only used for the "cold" path. That'd just add a
64->32 bit cast in the hot path, which shouldn't be measurable. But it'd
regress platforms without 64bit atomics substantially.

We could obviously also just rewrite the BgBufferSync() logic, so it
doesn't care about things like "lapping", but that's not an easy change.

So the best I can really suggest, unless we were to agree on atomics
being ok inside spinlocks, is probably to just replace the spinlock with
an lwlock. That'd perhaps cause a small slowdown for a few cases, but
it'd make workload that e.g. use the freelist a lot (e.g. when tables
are dropped regularly) scale better.

Greetings,

Andres Freund

#13

andres@anarazel.de

over 5 years ago

In reply to: Andres Freund (#11)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-04 19:33:02 -0700, Andres Freund wrote:

But it looks like that code is currently buggy (and looks like it always
has been), because we don't look at the nested argument when
initializing the semaphore. So we currently allocate too many
semaphores, without benefiting from them :(.

I wrote a patch for this, and when I got around to to testing it, I
found that our tests currently don't pass when using both
--disable-spinlocks and --disable-atomics. Turns out to not be related
to the issue above, but the global barrier support added in 13.

That *reads* two 64 bit atomics in a signal handler. Which is normally
fine, but not at all cool when atomics (or just 64 bit atomics) are
backed by spinlocks. Because we can "self interrupt" while already
holding the spinlock.

It looks to me that that's a danger whenever 64bit atomics are backed by
spinlocks, not just when both --disable-spinlocks and --disable-atomics
are used. But I suspect that it's really hard to hit the tiny window of
danger when those options aren't used. While we have buildfarm animals
testing each of those separately, we don't have one that tests both
together...

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation). I tested doing
that, and it fixes the hangs for me.

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

Greetings,

Andres Freund

#14

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#13)

Re: Atomic operations within spinlocks

Andres Freund <andres@anarazel.de> writes:

I wrote a patch for this, and when I got around to to testing it, I
found that our tests currently don't pass when using both
--disable-spinlocks and --disable-atomics. Turns out to not be related
to the issue above, but the global barrier support added in 13.
That *reads* two 64 bit atomics in a signal handler. Which is normally
fine, but not at all cool when atomics (or just 64 bit atomics) are
backed by spinlocks. Because we can "self interrupt" while already
holding the spinlock.

This is the sort of weird platform-specific problem that I'd prefer to
avoid by minimizing our expectations of what spinlocks can be used for.

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation).

Yeah, I think we need a hard rule that you can't use a spinlock in
an interrupt handler --- which means no atomics that don't have
non-spinlock implementations on every platform.

At some point I think we'll have to give up --disable-spinlocks;
it's really of pretty marginal use (how often does anyone port PG
to a new CPU type?) and the number of weird interactions it adds
in this area seems like more than it's worth. But of course
requiring 64-bit atomics is still a step too far.

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

I'm surprised we didn't get any compiler warnings about that.

regards, tom lane

#15

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#14)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-05 21:01:56 -0400, Tom Lane wrote:

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation).

Yeah, I think we need a hard rule that you can't use a spinlock in
an interrupt handler --- which means no atomics that don't have
non-spinlock implementations on every platform.

Yea, that might be the easiest thing to do. The only other thing I can
think of would be to mask all signals for the duration of the
atomic-using-spinlock operation. That'd make the fallback noticably more
expensive, but otoh, do we care enough?

I think a SIGNAL_HANDLER_BEGIN(); SIGNAL_HANDLER_END(); to back an
Assert(!InSignalHandler()); could be quite useful. Could also save
errno etc.

At some point I think we'll have to give up --disable-spinlocks; it's
really of pretty marginal use (how often does anyone port PG to a new
CPU type?) and the number of weird interactions it adds in this area
seems like more than it's worth.

Indeed. And any new architecture one would port PG to would have good
enough compiler intrinsics to make that trivial. I still think it'd make
sense to have a fallback implementation using compiler intrinsics...

And I think we should just require 32bit atomics at the same time. Would
probably kill gaur though.

I did just find a longstanding bug in the spinlock emulation code:

void
s_init_lock_sema(volatile slock_t *lock, bool nested)
{
static int counter = 0;

*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
}

void
s_unlock_sema(volatile slock_t *lock)
{
int lockndx = *lock;

if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
elog(ERROR, "invalid spinlock number: %d", lockndx);
PGSemaphoreUnlock(SpinlockSemaArray[lockndx - 1]);
}

I don't think it's ok that counter is a signed integer... While it maybe
used to be unlikely that we ever have that many spinlocks, I don't think
it's that hard anymore, because we dynamically allocate them for a lot
of parallel query stuff. A small regression test that initializes
enough spinlocks indeed errors out with
2020-06-05 18:08:29.110 PDT [734946][3/2:0] ERROR: invalid spinlock number: -126
2020-06-05 18:08:29.110 PDT [734946][3/2:0] STATEMENT: SELECT test_atomic_ops();

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

I'm surprised we didn't get any compiler warnings about that.

Unfortunately I don't think one can currently compile postgres with
warnings for "implicit casts" enabled :(.

But of course requiring 64-bit atomics is still a step too far.

If we had a 32bit compare-exchange it ought to be possible to write a
signal-safe emulation of 64bit atomics. I think. Something *roughly*
like:

typedef struct pg_atomic_uint64
{
/*
* Meaning of state bits:
* 0-1: current valid
* 2-4: current proposed
* 5: in signal handler
* 6-31: pid of proposer
*/
pg_atomic_uint32 state;

/*
* One current value, two different proposed values.
*/
uint64 value[3];
} pg_atomic_uint64;

The update protocol would be something roughly like:

bool
pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 *expected, uint64 newval)
{
while (true)
{
uint32 old_state = pg_atomic_read_u32(&ptr->state);
uint32 updater_pid = PID_FROM_STATE(old_state);
uint32 new_state;
uint64 old_value;

int proposing;

/*
* Value changed, so fail. This is obviously racy, but we'll
* notice concurrent updates later.
*/
if (ptr->value[VALID_FIELD(old_state)] != *expected)
{
return false;
}

if (updater_pid == INVALID_PID)
{

new_state = old_state;

/* signal that current process is updating */
new_state |= MyProcPid >> PID_STATE_SHIFT;
if (InSignalHandler)
new_state |= PROPOSER_IN_SIGNAL_HANDLER_BIT;

/* set which index is being proposed */
new_state = (new_state & ~PROPOSER_BITS) |
NEXT_PROPOSED_FIELD(old_state, &proposing);

/*
* If we successfully can update state to contain our new
* value, we have a right to do so, and can only be
* interrupted by ourselves, in a signal handler.
*/
if (!pg_atomic_compare_exchange(&ptr->state, &old_state, new_state))
{
/* somebody else updated, restart */
continue;
}

old_state = new_state;

/*
* It's ok to compare the values now. If we are interrupted
* by a signal handler, we'll notice when updating
* state. There's no danger updating the same proposed value
* in two processes, because they they always would get
* offsets to propse into.
*/
ptr->value[proposing] = newval;

/* set the valid field to the one we just filled in */
new_state = (new_state & ~VALID_FIELD_BITS) | proposed;
/* remove ourselve as updater */
new_state &= UPDATER_BITS;

if (!pg_atomic_compare_exchange(&ptr->state, &old_state, new_state))
{
/*
* Should only happen when we were interrupted by this
* processes' handler.
*/
Assert(!InSignalHandler);

/*
* Signal handler must have cleaned out pid as updater.
*/
Assert(PID_FROM_STATE(old_state) != MyProcPid);
continue;
}
else
{
return true;
}
}
else if (PID_FROM_STATE(current_state) == MyProcPid)
{
/*
* This should only happen when in a signal handler. We don't
* currently allow nesting of signal handlers.
*/
Assert(!(current_state & PROPOSER_IN_SIGNAL_HANDLER_BIT));

/* interrupt our own non-signal-handler update */
new_state = old_state | PROPOSER_IN_SIGNAL_HANDLER_BIT;

/* set which index is being proposed */
new_state = (new_state & ~PROPOSER_BITS) |
NEXT_PROPOSED_FIELD(old_state, &proposing);

// FIXME: assert that previous value still was what we assumed
pg_atomic_exchange_u32(&ptr_state.state, new_state);
}
else
{
do
{
pg_spin_delay();

current_state = pg_atomic_read_u32(&ptr->state);
} while (PID_FROM_STATE(current_state) != INVALID_PID)
}
}
}

While that's not trivial, it'd not be that expensive. The happy path
would be two 32bit atomic operations to simulate a 64bit one.

Greetings,

Andres Freund

#16

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Andres Freund (#15)

Re: Atomic operations within spinlocks

Andres Freund <andres@anarazel.de> writes:

On 2020-06-05 21:01:56 -0400, Tom Lane wrote:

At some point I think we'll have to give up --disable-spinlocks; it's
really of pretty marginal use (how often does anyone port PG to a new
CPU type?) and the number of weird interactions it adds in this area
seems like more than it's worth.

Indeed. And any new architecture one would port PG to would have good
enough compiler intrinsics to make that trivial. I still think it'd make
sense to have a fallback implementation using compiler intrinsics...

And I think we should just require 32bit atomics at the same time. Would
probably kill gaur though.

Not only gaur. A quick buildfarm survey finds these active members
reporting not having 32-bit atomics:

anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no
chipmunk | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no
curculio | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no
frogfish | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no
gaur | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no
gharial | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no
hornet | 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no
hoverfly | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no
locust | 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no
mandrill | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no
prairiedog | 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no

It looks to me like this is mostly about compiler support not the
hardware; that doesn't make it not a problem, though. (I also
remain skeptical about the quality of the compiler intrinsics
on non-mainstream hardware.)

regards, tom lane

#17

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#16)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-05 22:52:47 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-06-05 21:01:56 -0400, Tom Lane wrote:

At some point I think we'll have to give up --disable-spinlocks; it's
really of pretty marginal use (how often does anyone port PG to a new
CPU type?) and the number of weird interactions it adds in this area
seems like more than it's worth.

Indeed. And any new architecture one would port PG to would have good
enough compiler intrinsics to make that trivial. I still think it'd make
sense to have a fallback implementation using compiler intrinsics...

And I think we should just require 32bit atomics at the same time. Would
probably kill gaur though.

Not only gaur. A quick buildfarm survey finds these active members
reporting not having 32-bit atomics:

Hm, I don't think that's the right test. We have bespoke code to support
most of these, I think:

anole | 2020-06-05 11:20:17 | pgac_cv_gcc_atomic_int32_cas=no

Has support via acc specific intrinsics.

chipmunk | 2020-05-29 22:27:56 | pgac_cv_gcc_atomic_int32_cas=no

Doesn't have support for __atomic, but does have support for 32bit
__sync.

gharial | 2020-06-05 12:41:14 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit.

curculio | 2020-06-05 22:30:06 | pgac_cv_gcc_atomic_int32_cas=no
frogfish | 2020-05-31 13:00:25 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit.

mandrill | 2020-06-05 09:20:03 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for 32, as well as as inline asm for 32bit atomics
(although we might be able to add 64 bit).

hornet | 2020-06-05 09:11:26 | pgac_cv_gcc_atomic_int32_cas=no
hoverfly | 2020-06-05 22:06:14 | pgac_cv_gcc_atomic_int32_cas=no

__sync support for both 32 and 64 bit, and we have open coded ppc asm.

locust | 2020-06-05 10:14:29 | pgac_cv_gcc_atomic_int32_cas=no
prairiedog | 2020-06-05 09:55:49 | pgac_cv_gcc_atomic_int32_cas=no

Wee, these don't have __sync? But I think it should be able to use the
asm ppc implementation for 32 bit atomics.

gaur | 2020-05-19 13:33:25 | pgac_cv_gcc_atomic_int32_cas=no

As far as I understand pa-risc doesn't have any atomic instructions
except for TAS.

So I think gaur is really the only one that'd drop.

It looks to me like this is mostly about compiler support not the
hardware; that doesn't make it not a problem, though. (I also
remain skeptical about the quality of the compiler intrinsics
on non-mainstream hardware.)

I think that's fair enough for really old platforms, but at least for
gcc / clang I don't think it's a huge concern for newer ones. Even if
not mainstream. For gcc/clang the intrinsics basically back the
C11/C++11 "language level" atomics support. And those are extremely
widely used these days.

Greetings,

Andres Freund

#18

andres@anarazel.de

over 5 years ago

In reply to: Andres Freund (#13)

4 attachment(s)

Re: Atomic operations within spinlocks

On 2020-06-05 17:19:26 -0700, Andres Freund wrote:

Hi,

On 2020-06-04 19:33:02 -0700, Andres Freund wrote:

But it looks like that code is currently buggy (and looks like it always
has been), because we don't look at the nested argument when
initializing the semaphore. So we currently allocate too many
semaphores, without benefiting from them :(.

I wrote a patch for this, and when I got around to to testing it, I
found that our tests currently don't pass when using both
--disable-spinlocks and --disable-atomics. Turns out to not be related
to the issue above, but the global barrier support added in 13.

That *reads* two 64 bit atomics in a signal handler. Which is normally
fine, but not at all cool when atomics (or just 64 bit atomics) are
backed by spinlocks. Because we can "self interrupt" while already
holding the spinlock.

It looks to me that that's a danger whenever 64bit atomics are backed by
spinlocks, not just when both --disable-spinlocks and --disable-atomics
are used. But I suspect that it's really hard to hit the tiny window of
danger when those options aren't used. While we have buildfarm animals
testing each of those separately, we don't have one that tests both
together...

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation). I tested doing
that, and it fixes the hangs for me.

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

Attached is a series of patches addressing these issues, of varying
quality:

1) This fixes the above mentioned issue in the global barrier code by
using 32bit atomics. That might be fine, or it might not. I just
included it here because otherwise the tests cannot be run fully.

2) Fixes spinlock emulation when more than INT_MAX spinlocks are
initialized in the lifetime of a single backend

3) Add spinlock tests to normal regression tests.
- Currently as part of test_atomic_ops. Probably not worth having a
separate SQL function?
- Currently contains a test for 1) that's run when the spinlock
emulation is used. Probably too slow to actually indclude? Takes 15s
on my computer... OTOH, it's just with --disable-spinlocks...
- Could probably remove the current spinlock tests after this. The
only thing they additionally test is a stuck spinlock. Since
they're not run currently, they don't seem worth much?

4) Fix the potential for deadlocks when using atomics while holding a
spinlock, add tests for that.

Any comments?

Greetings,

Andres Freund

Attachments:

v1-0001-WORKAROUND-Avoid-spinlock-use-in-signal-handler-b.patchtext/x-diff; charset=us-asciiDownload

From 17a7730c282018d02e3d82990f4fd84e0d2f6ce0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 17:10:26 -0700
Subject: [PATCH v1 1/4] WORKAROUND: Avoid spinlock use in signal handler by
 using 32bit generations.

As 64bit atomic reads aren't aren't guaranteed to be atomic on all
platforms we can end up with spinlocks being used inside signal
handlers. Which obviously is dangerous, as that can easily create self
deadlocks.

32bit atomics don't have that problem. But 32bit generations might not
be enough in all situations (?). So this possibly isn't the right
fix.

The problem can easily be reproduced by running make check in a
--disable-spinlocks --disable-atomics build.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/procsignal.h     |  4 +--
 src/backend/storage/ipc/procsignal.c | 48 ++++++++++++++--------------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index a0c0bc3ce55..c5a7e362dd9 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -65,8 +65,8 @@ extern void ProcSignalInit(int pss_idx);
 extern int	SendProcSignal(pid_t pid, ProcSignalReason reason,
 						   BackendId backendId);
 
-extern uint64 EmitProcSignalBarrier(ProcSignalBarrierType type);
-extern void WaitForProcSignalBarrier(uint64 generation);
+extern uint32 EmitProcSignalBarrier(ProcSignalBarrierType type);
+extern void WaitForProcSignalBarrier(uint32 generation);
 extern void ProcessProcSignalBarrier(void);
 
 extern void procsignal_sigusr1_handler(SIGNAL_ARGS);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c809196d06a..eba424e15a7 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -60,7 +60,7 @@ typedef struct
 {
 	pid_t		pss_pid;
 	sig_atomic_t pss_signalFlags[NUM_PROCSIGNALS];
-	pg_atomic_uint64 pss_barrierGeneration;
+	pg_atomic_uint32 pss_barrierGeneration;
 	pg_atomic_uint32 pss_barrierCheckMask;
 } ProcSignalSlot;
 
@@ -72,7 +72,7 @@ typedef struct
  */
 typedef struct
 {
-	pg_atomic_uint64 psh_barrierGeneration;
+	pg_atomic_uint32 psh_barrierGeneration;
 	ProcSignalSlot psh_slot[FLEXIBLE_ARRAY_MEMBER];
 } ProcSignalHeader;
 
@@ -126,7 +126,7 @@ ProcSignalShmemInit(void)
 	{
 		int			i;
 
-		pg_atomic_init_u64(&ProcSignal->psh_barrierGeneration, 0);
+		pg_atomic_init_u32(&ProcSignal->psh_barrierGeneration, 0);
 
 		for (i = 0; i < NumProcSignalSlots; ++i)
 		{
@@ -134,7 +134,7 @@ ProcSignalShmemInit(void)
 
 			slot->pss_pid = 0;
 			MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags));
-			pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+			pg_atomic_init_u32(&slot->pss_barrierGeneration, PG_UINT32_MAX);
 			pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0);
 		}
 	}
@@ -151,7 +151,7 @@ void
 ProcSignalInit(int pss_idx)
 {
 	volatile ProcSignalSlot *slot;
-	uint64		barrier_generation;
+	uint32		barrier_generation;
 
 	Assert(pss_idx >= 1 && pss_idx <= NumProcSignalSlots);
 
@@ -178,8 +178,8 @@ ProcSignalInit(int pss_idx)
 	 */
 	pg_atomic_write_u32(&slot->pss_barrierCheckMask, 0);
 	barrier_generation =
-		pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
-	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
+		pg_atomic_read_u32(&ProcSignal->psh_barrierGeneration);
+	pg_atomic_write_u32(&slot->pss_barrierGeneration, barrier_generation);
 	pg_memory_barrier();
 
 	/* Mark slot with my PID */
@@ -230,7 +230,7 @@ CleanupProcSignalState(int status, Datum arg)
 	 * Make this slot look like it's absorbed all possible barriers, so that
 	 * no barrier waits block on it.
 	 */
-	pg_atomic_write_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX);
+	pg_atomic_write_u32(&slot->pss_barrierGeneration, PG_UINT32_MAX);
 
 	slot->pss_pid = 0;
 }
@@ -317,11 +317,11 @@ SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
  * Callers are entitled to assume that this function will not throw ERROR
  * or FATAL.
  */
-uint64
+uint32
 EmitProcSignalBarrier(ProcSignalBarrierType type)
 {
-	uint64		flagbit = UINT64CONST(1) << (uint64) type;
-	uint64		generation;
+	uint32		flagbit = 1 << (uint32) type;
+	uint32		generation;
 
 	/*
 	 * Set all the flags.
@@ -329,7 +329,7 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	 * Note that pg_atomic_fetch_or_u32 has full barrier semantics, so this is
 	 * totally ordered with respect to anything the caller did before, and
 	 * anything that we do afterwards. (This is also true of the later call to
-	 * pg_atomic_add_fetch_u64.)
+	 * pg_atomic_add_fetch_u32.)
 	 */
 	for (int i = 0; i < NumProcSignalSlots; i++)
 	{
@@ -342,7 +342,7 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 	 * Increment the generation counter.
 	 */
 	generation =
-		pg_atomic_add_fetch_u64(&ProcSignal->psh_barrierGeneration, 1);
+		pg_atomic_add_fetch_u32(&ProcSignal->psh_barrierGeneration, 1);
 
 	/*
 	 * Signal all the processes, so that they update their advertised barrier
@@ -379,16 +379,16 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
  * 1 second.
  */
 void
-WaitForProcSignalBarrier(uint64 generation)
+WaitForProcSignalBarrier(uint32 generation)
 {
 	long		timeout = 125L;
 
 	for (int i = NumProcSignalSlots - 1; i >= 0; i--)
 	{
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
-		uint64		oldval;
+		uint32		oldval;
 
-		oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+		oldval = pg_atomic_read_u32(&slot->pss_barrierGeneration);
 		while (oldval < generation)
 		{
 			int			events;
@@ -401,7 +401,7 @@ WaitForProcSignalBarrier(uint64 generation)
 						  timeout, WAIT_EVENT_PROC_SIGNAL_BARRIER);
 			ResetLatch(MyLatch);
 
-			oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration);
+			oldval = pg_atomic_read_u32(&slot->pss_barrierGeneration);
 			if (events & WL_TIMEOUT)
 				timeout = Min(timeout * 2, 1000L);
 		}
@@ -428,7 +428,7 @@ WaitForProcSignalBarrier(uint64 generation)
 void
 ProcessProcSignalBarrier(void)
 {
-	uint64		generation;
+	uint32		generation;
 	uint32		flags;
 
 	/* Exit quickly if there's no work to do. */
@@ -443,7 +443,7 @@ ProcessProcSignalBarrier(void)
 	 * happens before we atomically extract the flags, and that any subsequent
 	 * state changes happen afterward.
 	 */
-	generation = pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
+	generation = pg_atomic_read_u32(&ProcSignal->psh_barrierGeneration);
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
@@ -466,7 +466,7 @@ ProcessProcSignalBarrier(void)
 	 * things have changed further, it'll get fixed up when this function is
 	 * next called.
 	 */
-	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, generation);
+	pg_atomic_write_u32(&MyProcSignalSlot->pss_barrierGeneration, generation);
 }
 
 static void
@@ -515,11 +515,11 @@ CheckProcSignalBarrier(void)
 
 	if (slot != NULL)
 	{
-		uint64		mygen;
-		uint64		curgen;
+		uint32		mygen;
+		uint32		curgen;
 
-		mygen = pg_atomic_read_u64(&slot->pss_barrierGeneration);
-		curgen = pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
+		mygen = pg_atomic_read_u32(&slot->pss_barrierGeneration);
+		curgen = pg_atomic_read_u32(&ProcSignal->psh_barrierGeneration);
 		return (mygen != curgen);
 	}
 
-- 
2.25.0.114.g5b0ca878e0

v1-0002-spinlock-emulation-Fix-bug-when-more-than-INT_MAX.patchtext/x-diff; charset=us-asciiDownload

From 23ded090b1ee9d91ea4455ddde30427c3756c78a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 15:25:49 -0700
Subject: [PATCH v1 2/4] spinlock emulation: Fix bug when more than INT_MAX
 spinlocks are initialized.

Once the counter goes negative we ended up with spinlocks that errored
out on first use (due to check in tas_sema).

Author: Andres Freund
Discussion: https://postgr.es/m/20200606023103.avzrctgv7476xj7i@alap3.anarazel.de
Backpatch: 9.5-
---
 src/backend/storage/lmgr/spin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 4d2a4c6641a..753943e46d6 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -106,7 +106,7 @@ SpinlockSemaInit(void)
 void
 s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
-	static int	counter = 0;
+	static uint32 counter = 0;
 
 	*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
 }
-- 
2.25.0.114.g5b0ca878e0

v1-0003-Add-basic-spinlock-tests-to-regression-tests.patchtext/x-diff; charset=us-asciiDownload

From 7eb9faf6b33792a22eccc4dd6f5fc6727c17fb41 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 16:36:51 -0700
Subject: [PATCH v1 3/4] Add basic spinlock tests to regression tests.

As s_lock_test, the already existing test for spinlocks, isn't run in
an automated fashion (and doesn't test a normal backend environment),
adding tests that are run as part of a normal regression run is a good
idea.

Currently the new tests are run as part of the pre-existing
test_atomic_ops() test. That can probably be quibbled about.

The only operations that s_lock_test tests but the new tests don't are
the detection of a stuck spinlock and S_LOCK_FREE (which is otherwise
unused).

This currently contains a test for more than INT_MAX spinlocks (only
run with --disable-spinlocks), to ensure the previous commit fixing a
bug with more than INT_MAX spinlock initializations is correct, but
that might be too slow to enable generally.

It might be worth retiring s_lock_test after this. The added coverage
of a stuck spinlock probably isn't worth the added complexity?

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/regress.c | 89 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 960c155e5f2..a48f9de2532 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -794,6 +794,92 @@ test_atomic_uint64(void)
 	EXPECT_EQ_U64(pg_atomic_fetch_and_u64(&var, ~0), 0);
 }
 
+static void
+test_spinlock(void)
+{
+	{
+		struct test_lock_struct
+		{
+			uint32		data_before;
+			slock_t		lock;
+			uint32		data_after;
+		} struct_w_lock;
+
+		struct_w_lock.data_before = 0x44;
+		struct_w_lock.data_after = 0x17;
+
+		/* test basic operations via the SpinLock* API */
+		SpinLockInit(&struct_w_lock.lock);
+		SpinLockAcquire(&struct_w_lock.lock);
+		SpinLockRelease(&struct_w_lock.lock);
+
+		/* test basic operations via underlying S_* API */
+		S_INIT_LOCK(&struct_w_lock.lock);
+		S_LOCK(&struct_w_lock.lock);
+		S_UNLOCK(&struct_w_lock.lock);
+
+		/* and that "contended" acquisition works */
+		s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+		S_UNLOCK(&struct_w_lock.lock);
+
+		/*
+		 * Check, using TAS directly, that a single spin cyle doesn't block
+		 * when acquiring an already acquired lock.
+		 */
+#ifdef TAS
+		S_LOCK(&struct_w_lock.lock);
+		if (!TAS(&struct_w_lock.lock))
+			elog(ERROR, "acquired already held spinlock");
+
+#ifdef TAS_SPIN
+		if (!TAS_SPIN(&struct_w_lock.lock))
+			elog(ERROR, "acquired already held spinlock");
+#endif							/* defined(TAS_SPIN) */
+
+		S_UNLOCK(&struct_w_lock.lock);
+#endif							/* defined(TAS) */
+
+		/*
+		 * Verify that after all of this the non-lock contents are still
+		 * correct.
+		 */
+		EXPECT_EQ_U32(struct_w_lock.data_before, 0x44);
+		EXPECT_EQ_U32(struct_w_lock.data_after, 0x17);
+	}
+
+	/*
+	 * Ensure that allocating more than INT32_MAX simulated spinlocks
+	 * works. This is probably too expensive to run each regression test.
+	 */
+#ifndef HAVE_SPINLOCKS
+	{
+		/*
+		 * Initialize enough spinlocks to advance counter close to
+		 * wraparound. It's too expensive to perform acquire/release for each,
+		 * as those may be syscalls when the spinlock emulation is used (and
+		 * even just atomic TAS would be expensive).
+		 */
+		for (uint32 i = 0; i < INT32_MAX - 100000; i++)
+		{
+			slock_t lock;
+
+			SpinLockInit(&lock);
+		}
+
+		for (uint32 i = 0; i < 200000; i++)
+		{
+			slock_t lock;
+
+			SpinLockInit(&lock);
+
+			SpinLockAcquire(&lock);
+			SpinLockRelease(&lock);
+			SpinLockAcquire(&lock);
+			SpinLockRelease(&lock);
+		}
+	}
+#endif
+}
 
 PG_FUNCTION_INFO_V1(test_atomic_ops);
 Datum
@@ -805,6 +891,9 @@ test_atomic_ops(PG_FUNCTION_ARGS)
 
 	test_atomic_uint64();
 
+	/* XXX: Is there a better location for this? */
+	test_spinlock();
+
 	PG_RETURN_BOOL(true);
 }
 
-- 
2.25.0.114.g5b0ca878e0

v1-0004-Fix-deadlock-danger-when-atomic-ops-are-done-unde.patchtext/x-diff; charset=us-asciiDownload

From 1d313fd848882c97651e3d7592e71496bbbcc339 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 16:50:37 -0700
Subject: [PATCH v1 4/4] Fix deadlock danger when atomic ops are done under
 spinlock.

This was a danger only for --disable-spinlocks in combination with
atomic operations unsupported by the current platform.

While atomics.c was careful to signal that a separate semaphore ought
to be used when spinlock emulation is active, spin.c didn't actually
implement that mechanism. That's my (Andres') fault, it seems to have
gotten lost during the development of the atomic operations support.

Fix that issue and add test for nesting atomic operations inside a
spinlock.

Author: Andres Freund
Discussion: https://postgr.es/m/20200605023302.g6v3ydozy5txifji@alap3.anarazel.de
Backpatch: 9.5-
---
 src/backend/storage/lmgr/spin.c | 95 +++++++++++++++++++++++----------
 src/test/regress/regress.c      | 43 +++++++++++++++
 2 files changed, 109 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 753943e46d6..141606496eb 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -28,8 +28,24 @@
 
 
 #ifndef HAVE_SPINLOCKS
+
+/*
+ * No TAS, so spinlocks are implemented as PGSemaphores.
+ */
+
+#ifndef HAVE_ATOMICS
+#define NUM_SIMULATION_SEMAPHORES (NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES)
+#else
+#define NUM_SIMULATION_SEMAPHORES (NUM_SPINLOCK_SEMAPHORES)
+#endif /* DISABLE_ATOMICS */
+
 PGSemaphore *SpinlockSemaArray;
-#endif
+
+#else							/* !HAVE_SPINLOCKS */
+
+#define NUM_SIMULATION_SEMAPHORES 0
+
+#endif							/* HAVE_SPINLOCKS */
 
 /*
  * Report the amount of shared memory needed to store semaphores for spinlock
@@ -38,34 +54,19 @@ PGSemaphore *SpinlockSemaArray;
 Size
 SpinlockSemaSize(void)
 {
-	return SpinlockSemas() * sizeof(PGSemaphore);
+	return NUM_SIMULATION_SEMAPHORES * sizeof(PGSemaphore);
 }
 
-#ifdef HAVE_SPINLOCKS
-
 /*
  * Report number of semaphores needed to support spinlocks.
  */
 int
 SpinlockSemas(void)
 {
-	return 0;
+	return NUM_SIMULATION_SEMAPHORES;
 }
-#else							/* !HAVE_SPINLOCKS */
 
-/*
- * No TAS, so spinlocks are implemented as PGSemaphores.
- */
-
-
-/*
- * Report number of semaphores needed to support spinlocks.
- */
-int
-SpinlockSemas(void)
-{
-	return NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES;
-}
+#ifndef HAVE_SPINLOCKS
 
 /*
  * Initialize spinlock emulation.
@@ -92,23 +93,59 @@ SpinlockSemaInit(void)
 /*
  * s_lock.h hardware-spinlock emulation using semaphores
  *
- * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
- * It's okay to map multiple spinlocks onto one semaphore because no process
- * should ever hold more than one at a time.  We just need enough semaphores
- * so that we aren't adding too much extra contention from that.
+ * We map all spinlocks onto NUM_SIMULATION_SEMAPHORES semaphores.  It's okay to
+ * map multiple spinlocks onto one semaphore because no process should ever
+ * hold more than one at a time.  We just need enough semaphores so that we
+ * aren't adding too much extra contention from that.
+ *
+ * There is one exception to the restriction of only holding one spinlock at a
+ * time, which is that it's ok if emulated atomic operations are nested inside
+ * spinlocks. To avoid the danger of spinlocks and atomic using the same sema,
+ * we make sure "normal" spinlocks and atomics backed by spinlocks use
+ * distinct semaphores (see the nested argument to s_init_lock_sema).
  *
  * slock_t is just an int for this implementation; it holds the spinlock
- * number from 1..NUM_SPINLOCK_SEMAPHORES.  We intentionally ensure that 0
+ * number from 1..NUM_SIMULATION_SEMAPHORES.  We intentionally ensure that 0
  * is not a valid value, so that testing with this code can help find
  * failures to initialize spinlocks.
  */
 
+static inline void
+s_check_valid(int lockndx)
+{
+	if (unlikely(lockndx <= 0 || lockndx > NUM_SIMULATION_SEMAPHORES))
+		elog(ERROR, "invalid spinlock number: %d", lockndx);
+}
+
 void
 s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
 	static uint32 counter = 0;
+	uint32		offset;
+	uint32		sema_total;
+	uint32		idx;
 
-	*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
+	if (nested)
+	{
+		/*
+		 * To allow nesting atomics inside spinlocked sections, use a
+		 * different spinlock. See comment above.
+		 */
+		offset = 1 + NUM_SPINLOCK_SEMAPHORES;
+		sema_total = NUM_ATOMICS_SEMAPHORES;
+	}
+	else
+	{
+		offset = 1;
+		sema_total = NUM_SPINLOCK_SEMAPHORES;
+	}
+
+	idx = (counter++ % sema_total) + offset;
+
+	/* double check we did things correctly */
+	s_check_valid(idx);
+
+	*lock = idx;
 }
 
 void
@@ -116,8 +153,8 @@ s_unlock_sema(volatile slock_t *lock)
 {
 	int			lockndx = *lock;
 
-	if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
-		elog(ERROR, "invalid spinlock number: %d", lockndx);
+	s_check_valid(lockndx);
+
 	PGSemaphoreUnlock(SpinlockSemaArray[lockndx - 1]);
 }
 
@@ -134,8 +171,8 @@ tas_sema(volatile slock_t *lock)
 {
 	int			lockndx = *lock;
 
-	if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
-		elog(ERROR, "invalid spinlock number: %d", lockndx);
+	s_check_valid(lockndx);
+
 	/* Note that TAS macros return 0 if *success* */
 	return !PGSemaphoreTryLock(SpinlockSemaArray[lockndx - 1]);
 }
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index a48f9de2532..231aab9d569 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -794,6 +794,47 @@ test_atomic_uint64(void)
 	EXPECT_EQ_U64(pg_atomic_fetch_and_u64(&var, ~0), 0);
 }
 
+/*
+ * Verify that performing atomic ops inside a spinlock isn't a
+ * problem. Realistically that's only going to be a problem when both
+ * --disable-spinlocks and --disable-atomics are used, but it's cheap enough
+ * to just always test.
+ *
+ * The test works by initializing enough atomics that we'd conflict if there
+ * were an overlap between a spinlock and an atomic by holding a spinlock
+ * while manipulating more than NUM_SPINLOCK_SEMAPHORES atomics.
+ *
+ * NUM_TEST_ATOMICS doesn't really need to be more than
+ * NUM_SPINLOCK_SEMAPHORES, but it seems better to test a bit more
+ * extensively.
+ */
+static void
+test_atomic_spin_nest(void)
+{
+	slock_t lock;
+#define NUM_TEST_ATOMICS (NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES + 27)
+	pg_atomic_uint32 atomics[NUM_TEST_ATOMICS];
+
+	SpinLockInit(&lock);
+
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+		pg_atomic_init_u32(&atomics[i], 0);
+
+	/* just so it's not all zeroes */
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+		EXPECT_EQ_U32(pg_atomic_fetch_add_u32(&atomics[i], i), 0);
+
+	/* test whether we can do atomic op with lock held */
+	SpinLockAcquire(&lock);
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+	{
+		EXPECT_EQ_U32(pg_atomic_fetch_sub_u32(&atomics[i], i), i);
+		EXPECT_EQ_U32(pg_atomic_read_u32(&atomics[i]), 0);
+	}
+	SpinLockRelease(&lock);
+}
+#undef NUM_TEST_ATOMICS
+
 static void
 test_spinlock(void)
 {
@@ -891,6 +932,8 @@ test_atomic_ops(PG_FUNCTION_ARGS)
 
 	test_atomic_uint64();
 
+	test_atomic_spin_nest();
+
 	/* XXX: Is there a better location for this? */
 	test_spinlock();
 
-- 
2.25.0.114.g5b0ca878e0

#19

andres@anarazel.de

over 5 years ago

In reply to: Andres Freund (#18)

global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-08 23:08:47 -0700, Andres Freund wrote:

On 2020-06-05 17:19:26 -0700, Andres Freund wrote:

I wrote a patch for this, and when I got around to to testing it, I
found that our tests currently don't pass when using both
--disable-spinlocks and --disable-atomics. Turns out to not be related
to the issue above, but the global barrier support added in 13.

That *reads* two 64 bit atomics in a signal handler. Which is normally
fine, but not at all cool when atomics (or just 64 bit atomics) are
backed by spinlocks. Because we can "self interrupt" while already
holding the spinlock.

It looks to me that that's a danger whenever 64bit atomics are backed by
spinlocks, not just when both --disable-spinlocks and --disable-atomics
are used. But I suspect that it's really hard to hit the tiny window of
danger when those options aren't used. While we have buildfarm animals
testing each of those separately, we don't have one that tests both
together...

I'm not really sure what to do about that issue. The easisest thing
would probably be to change the barrier generation to 32bit (which
doesn't have to use locks for reads in any situation). I tested doing
that, and it fixes the hangs for me.

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

Attached is a series of patches addressing these issues, of varying
quality:

1) This fixes the above mentioned issue in the global barrier code by
using 32bit atomics. That might be fine, or it might not. I just
included it here because otherwise the tests cannot be run fully.

Hm. Looking at this again, perhaps the better fix would be to simply not
look at the concrete values of the barrier inside the signal handler?
E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
ProcSignalBarrierPending to be set. And then have
ProcessProcSignalBarrier do the check that's currently in
CheckProcSignalBarrier()?

Greetings,

Andres Freund

#20

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#19)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Tue, Jun 9, 2020 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:

Hm. Looking at this again, perhaps the better fix would be to simply not
look at the concrete values of the barrier inside the signal handler?
E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
ProcSignalBarrierPending to be set. And then have
ProcessProcSignalBarrier do the check that's currently in
CheckProcSignalBarrier()?

That seems like a good idea.

Also, I wonder if someone would be willing to set up a BF animal for this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#20)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-09 17:04:42 -0400, Robert Haas wrote:

On Tue, Jun 9, 2020 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:

Hm. Looking at this again, perhaps the better fix would be to simply not
look at the concrete values of the barrier inside the signal handler?
E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
ProcSignalBarrierPending to be set. And then have
ProcessProcSignalBarrier do the check that's currently in
CheckProcSignalBarrier()?

That seems like a good idea.

Cool.

Also, I wonder if someone would be willing to set up a BF animal for this.

You mean having both --disable-atomics and --disable-spinlocks? If so,
I'm planning to do that (I already have the animals that do those
separately, so it seems to make sense to add it to that collection).

What do you think about my idea of having a BEGIN/END_SIGNAL_HANDLER?
That'd make it much easier to write assertions forbidding palloc, 64bit
atomics, ...

Greetings,

Andres Freund

#22

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#13)

Re: Atomic operations within spinlocks

On Fri, Jun 5, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote:

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

I'm going to admit ignorance here. What's the proper coding rule?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#22)

Re: Atomic operations within spinlocks

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jun 5, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote:

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

I'm going to admit ignorance here. What's the proper coding rule?

The shift distance can't exceed 64, so there's no need for it to be
wider than int. "type" is an enum, so explicitly casting it to an
integral type seems like good practice, but int is sufficient.

ISTR older compilers insisting that the shift distance not be
wider than int. But C99 doesn't seem to require that -- it only
restricts the value of the right operand.

regards, tom lane

#24

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#21)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Tue, Jun 9, 2020 at 6:54 PM Andres Freund <andres@anarazel.de> wrote:

What do you think about my idea of having a BEGIN/END_SIGNAL_HANDLER?
That'd make it much easier to write assertions forbidding palloc, 64bit
atomics, ...

I must have missed the previous place where you suggested this, but I
think it's a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#22)

Re: Atomic operations within spinlocks

Hi,

On 2020-06-10 07:26:32 -0400, Robert Haas wrote:

On Fri, Jun 5, 2020 at 8:19 PM Andres Freund <andres@anarazel.de> wrote:

Randomly noticed while looking at the code:
uint64 flagbit = UINT64CONST(1) << (uint64) type;

that shouldn't be 64bit, right?

I'm going to admit ignorance here. What's the proper coding rule?

Well, pss_barrierCheckMask member is just 32bit, so it seems odd to
declare the local variable 64bit?

uint64 flagbit = UINT64CONST(1) << (uint64) type;
...
pg_atomic_fetch_or_u32(&slot->pss_barrierCheckMask, flagbit);

Greetings,

Andres Freund

#26

/messages/by-id/20200606023103.avzrctgv7476xj7i@alap3.anarazel.de

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#24)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-10 13:37:59 -0400, Robert Haas wrote:

On Tue, Jun 9, 2020 at 6:54 PM Andres Freund <andres@anarazel.de> wrote:

What do you think about my idea of having a BEGIN/END_SIGNAL_HANDLER?
That'd make it much easier to write assertions forbidding palloc, 64bit
atomics, ...

I must have missed the previous place where you suggested this, but I
think it's a good idea.

It'd be neat if we could do that entirely within pqsignal(). But that'd
require some additional state (I think an array of handlers, indexed by
signum).

Greetings,

Andres Freund

#27

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#25)

Re: Atomic operations within spinlocks

On Thu, Jun 11, 2020 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:

Well, pss_barrierCheckMask member is just 32bit, so it seems odd to
declare the local variable 64bit?

uint64 flagbit = UINT64CONST(1) << (uint64) type;
...
pg_atomic_fetch_or_u32(&slot->pss_barrierCheckMask, flagbit);

Oooooops.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#20)

4 attachment(s)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-09 17:04:42 -0400, Robert Haas wrote:

On Tue, Jun 9, 2020 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:

Hm. Looking at this again, perhaps the better fix would be to simply not
look at the concrete values of the barrier inside the signal handler?
E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
ProcSignalBarrierPending to be set. And then have
ProcessProcSignalBarrier do the check that's currently in
CheckProcSignalBarrier()?

That seems like a good idea.

What do you think about 0002?

With regard to the cost of the expensive test in 0003, I'm somewhat
inclined to add that to the buildfarm for a few days and see how it
actually affects the few bf animals without atomics. We can rip it out
after we got some additional coverage (or leave it in if it turns out to
be cheap enough in comparison).

Also, I wonder if someone would be willing to set up a BF animal for this.

FWIW, I've requested a buildfarm animal id for this a few days ago, but
haven't received a response yet...

Greetings,

Andres Freund

Attachments:

v1-0001-spinlock-emulation-Fix-bug-when-more-than-INT_MAX.patchtext/x-diff; charset=us-asciiDownload

From 36601fe27dfefae7f5c1221fbfd364c6d7def4b5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 15:25:49 -0700
Subject: [PATCH v1 1/4] spinlock emulation: Fix bug when more than INT_MAX
 spinlocks are initialized.

Once the counter goes negative we ended up with spinlocks that errored
out on first use (due to check in tas_sema).

Author: Andres Freund
Discussion: https://postgr.es/m/20200606023103.avzrctgv7476xj7i@alap3.anarazel.de
Backpatch: 9.5-
---
 src/backend/storage/lmgr/spin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 4d2a4c6641a..753943e46d6 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -106,7 +106,7 @@ SpinlockSemaInit(void)
 void
 s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
-	static int	counter = 0;
+	static uint32 counter = 0;
 
 	*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
 }
-- 
2.25.0.114.g5b0ca878e0

v1-0002-Avoid-potential-spinlock-use-inside-a-signal-hand.patchtext/x-diff; charset=us-asciiDownload

From 73d72693fc7a55ae327212f3932d2a45eb47d13b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 15 Jun 2020 18:23:10 -0700
Subject: [PATCH v1 2/4] Avoid potential spinlock use inside a signal handler
 for global barriers.

On platforms without support for 64bit atomic operations where we also
cannot rely on 64bit reads to have single copy atomicity, such atomics
are implemented using a spinlock based fallback. That means it's not
safe to even read such atomics from within a signal handler (since the
signal handler might run when the spinlock already is held).

To avoid this issue defer global barrier processing out of the signal
handler. Instead of checking local / shared barrier generation to
determine whether to set ProcSignalBarrierPending, introduce
PROCSIGNAL_BARRIER and always set ProcSignalBarrierPending when
receiving such a signal. Additionally avoid redundant work in
ProcessProcSignalBarrier if ProcSignalBarrierPending is unnecessarily.

Also do a small amount of other polishing.

Author: Andres Freund
Discussion: https://postgr.es/m/20200609193723.eu5ilsjxwdpyxhgz@alap3.anarazel.de
Backpatch: 13-, where the code was introduced.
---
 src/include/storage/procsignal.h     |  1 +
 src/backend/storage/ipc/procsignal.c | 87 ++++++++++++++++------------
 2 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index a0c0bc3ce55..5cb39697f38 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -33,6 +33,7 @@ typedef enum
 	PROCSIG_NOTIFY_INTERRUPT,	/* listen/notify interrupt */
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
+	PROCSIG_BARRIER,			/* global barrier interrupt  */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c809196d06a..4fa385b0ece 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -320,7 +320,7 @@ SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
 uint64
 EmitProcSignalBarrier(ProcSignalBarrierType type)
 {
-	uint64		flagbit = UINT64CONST(1) << (uint64) type;
+	uint32		flagbit = 1 << (uint32) type;
 	uint64		generation;
 
 	/*
@@ -363,7 +363,11 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 		pid_t		pid = slot->pss_pid;
 
 		if (pid != 0)
+		{
+			/* see SendProcSignal for details */
+			slot->pss_signalFlags[PROCSIG_BARRIER] = true;
 			kill(pid, SIGUSR1);
+		}
 	}
 
 	return generation;
@@ -383,6 +387,8 @@ WaitForProcSignalBarrier(uint64 generation)
 {
 	long		timeout = 125L;
 
+	Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration));
+
 	for (int i = NumProcSignalSlots - 1; i >= 0; i--)
 	{
 		volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i];
@@ -417,6 +423,23 @@ WaitForProcSignalBarrier(uint64 generation)
 	pg_memory_barrier();
 }
 
+/*
+ * Handle receipt of an interrupt indicating a global barrier event.
+ *
+ * All the actual work is deferred to ProcessProcSignalBarrier(), because we
+ * cannot safely access the barrier generation inside the signal handler as
+ * 64bit atomics might use spinlock based emulation, even for reads. As this
+ * routine only gets called when PROCSIG_BARRIER is sent that won't cause a
+ * lot fo unnecessary work.
+ */
+static void
+HandleProcSignalBarrierInterrupt(void)
+{
+	InterruptPending = true;
+	ProcSignalBarrierPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
+
 /*
  * Perform global barrier related interrupt checking.
  *
@@ -428,22 +451,38 @@ WaitForProcSignalBarrier(uint64 generation)
 void
 ProcessProcSignalBarrier(void)
 {
-	uint64		generation;
+	uint64		local_gen;
+	uint64		shared_gen;
 	uint32		flags;
 
+	Assert(MyProcSignalSlot);
+
 	/* Exit quickly if there's no work to do. */
 	if (!ProcSignalBarrierPending)
 		return;
 	ProcSignalBarrierPending = false;
 
 	/*
-	 * Read the current barrier generation, and then get the flags that are
-	 * set for this backend. Note that pg_atomic_exchange_u32 is a full
-	 * barrier, so we're guaranteed that the read of the barrier generation
-	 * happens before we atomically extract the flags, and that any subsequent
-	 * state changes happen afterward.
+	 * It's not unlikely to process multiple barriers at once, before the
+	 * signals for all the barriers have arrived. To avoid unnecessary work in
+	 * response to subsequent signals, exit early if we already have processed
+	 * all of them.
+	 */
+	local_gen = pg_atomic_read_u64(&MyProcSignalSlot->pss_barrierGeneration);
+	shared_gen = pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
+
+	Assert(local_gen <= shared_gen);
+
+	if (local_gen == shared_gen)
+		return;
+
+	/*
+	 * Get and clear the flags that are set for this backend. Note that
+	 * pg_atomic_exchange_u32 is a full barrier, so we're guaranteed that the
+	 * read of the barrier generation above happens before we atomically
+	 * extract the flags, and that any subsequent state changes happen
+	 * afterward.
 	 */
-	generation = pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
 	flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);
 
 	/*
@@ -466,7 +505,7 @@ ProcessProcSignalBarrier(void)
 	 * things have changed further, it'll get fixed up when this function is
 	 * next called.
 	 */
-	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, generation);
+	pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen);
 }
 
 static void
@@ -505,27 +544,6 @@ CheckProcSignal(ProcSignalReason reason)
 	return false;
 }
 
-/*
- * CheckProcSignalBarrier - check for new barriers we need to absorb
- */
-static bool
-CheckProcSignalBarrier(void)
-{
-	volatile ProcSignalSlot *slot = MyProcSignalSlot;
-
-	if (slot != NULL)
-	{
-		uint64		mygen;
-		uint64		curgen;
-
-		mygen = pg_atomic_read_u64(&slot->pss_barrierGeneration);
-		curgen = pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration);
-		return (mygen != curgen);
-	}
-
-	return false;
-}
-
 /*
  * procsignal_sigusr1_handler - handle SIGUSR1 signal.
  */
@@ -546,6 +564,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_WALSND_INIT_STOPPING))
 		HandleWalSndInitStopping();
 
+	if (CheckProcSignal(PROCSIG_BARRIER))
+		HandleProcSignalBarrierInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
@@ -564,12 +585,6 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
-	if (CheckProcSignalBarrier())
-	{
-		InterruptPending = true;
-		ProcSignalBarrierPending = true;
-	}
-
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
-- 
2.25.0.114.g5b0ca878e0

v1-0003-Add-basic-spinlock-tests-to-regression-tests.patchtext/x-diff; charset=us-asciiDownload

From a77798d9125e799d3a2cdc3e7a514fb869219ed7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 16:36:51 -0700
Subject: [PATCH v1 3/4] Add basic spinlock tests to regression tests.

As s_lock_test, the already existing test for spinlocks, isn't run in
an automated fashion (and doesn't test a normal backend environment),
adding tests that are run as part of a normal regression run is a good
idea.

Currently the new tests are run as part of the pre-existing
test_atomic_ops() test. That can probably be quibbled about.

The only operations that s_lock_test tests but the new tests don't are
the detection of a stuck spinlock and S_LOCK_FREE (which is otherwise
unused).

This currently contains a test for more than INT_MAX spinlocks (only
run with --disable-spinlocks), to ensure the previous commit fixing a
bug with more than INT_MAX spinlock initializations is correct, but
that might be too slow to enable generally.

It might be worth retiring s_lock_test after this. The added coverage
of a stuck spinlock probably isn't worth the added complexity?

Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/regress.c | 89 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 960c155e5f2..a48f9de2532 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -794,6 +794,92 @@ test_atomic_uint64(void)
 	EXPECT_EQ_U64(pg_atomic_fetch_and_u64(&var, ~0), 0);
 }
 
+static void
+test_spinlock(void)
+{
+	{
+		struct test_lock_struct
+		{
+			uint32		data_before;
+			slock_t		lock;
+			uint32		data_after;
+		} struct_w_lock;
+
+		struct_w_lock.data_before = 0x44;
+		struct_w_lock.data_after = 0x17;
+
+		/* test basic operations via the SpinLock* API */
+		SpinLockInit(&struct_w_lock.lock);
+		SpinLockAcquire(&struct_w_lock.lock);
+		SpinLockRelease(&struct_w_lock.lock);
+
+		/* test basic operations via underlying S_* API */
+		S_INIT_LOCK(&struct_w_lock.lock);
+		S_LOCK(&struct_w_lock.lock);
+		S_UNLOCK(&struct_w_lock.lock);
+
+		/* and that "contended" acquisition works */
+		s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+		S_UNLOCK(&struct_w_lock.lock);
+
+		/*
+		 * Check, using TAS directly, that a single spin cyle doesn't block
+		 * when acquiring an already acquired lock.
+		 */
+#ifdef TAS
+		S_LOCK(&struct_w_lock.lock);
+		if (!TAS(&struct_w_lock.lock))
+			elog(ERROR, "acquired already held spinlock");
+
+#ifdef TAS_SPIN
+		if (!TAS_SPIN(&struct_w_lock.lock))
+			elog(ERROR, "acquired already held spinlock");
+#endif							/* defined(TAS_SPIN) */
+
+		S_UNLOCK(&struct_w_lock.lock);
+#endif							/* defined(TAS) */
+
+		/*
+		 * Verify that after all of this the non-lock contents are still
+		 * correct.
+		 */
+		EXPECT_EQ_U32(struct_w_lock.data_before, 0x44);
+		EXPECT_EQ_U32(struct_w_lock.data_after, 0x17);
+	}
+
+	/*
+	 * Ensure that allocating more than INT32_MAX simulated spinlocks
+	 * works. This is probably too expensive to run each regression test.
+	 */
+#ifndef HAVE_SPINLOCKS
+	{
+		/*
+		 * Initialize enough spinlocks to advance counter close to
+		 * wraparound. It's too expensive to perform acquire/release for each,
+		 * as those may be syscalls when the spinlock emulation is used (and
+		 * even just atomic TAS would be expensive).
+		 */
+		for (uint32 i = 0; i < INT32_MAX - 100000; i++)
+		{
+			slock_t lock;
+
+			SpinLockInit(&lock);
+		}
+
+		for (uint32 i = 0; i < 200000; i++)
+		{
+			slock_t lock;
+
+			SpinLockInit(&lock);
+
+			SpinLockAcquire(&lock);
+			SpinLockRelease(&lock);
+			SpinLockAcquire(&lock);
+			SpinLockRelease(&lock);
+		}
+	}
+#endif
+}
 
 PG_FUNCTION_INFO_V1(test_atomic_ops);
 Datum
@@ -805,6 +891,9 @@ test_atomic_ops(PG_FUNCTION_ARGS)
 
 	test_atomic_uint64();
 
+	/* XXX: Is there a better location for this? */
+	test_spinlock();
+
 	PG_RETURN_BOOL(true);
 }
 
-- 
2.25.0.114.g5b0ca878e0

v1-0004-Fix-deadlock-danger-when-atomic-ops-are-done-unde.patchtext/x-diff; charset=us-asciiDownload

From 5ce7f091400c03a93d10a979264041bba172d8b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 8 Jun 2020 16:50:37 -0700
Subject: [PATCH v1 4/4] Fix deadlock danger when atomic ops are done under
 spinlock.

This was a danger only for --disable-spinlocks in combination with
atomic operations unsupported by the current platform.

While atomics.c was careful to signal that a separate semaphore ought
to be used when spinlock emulation is active, spin.c didn't actually
implement that mechanism. That's my (Andres') fault, it seems to have
gotten lost during the development of the atomic operations support.

Fix that issue and add test for nesting atomic operations inside a
spinlock.

Author: Andres Freund
Discussion: https://postgr.es/m/20200605023302.g6v3ydozy5txifji@alap3.anarazel.de
Backpatch: 9.5-
---
 src/backend/storage/lmgr/spin.c | 95 +++++++++++++++++++++++----------
 src/test/regress/regress.c      | 43 +++++++++++++++
 2 files changed, 109 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 753943e46d6..141606496eb 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -28,8 +28,24 @@
 
 
 #ifndef HAVE_SPINLOCKS
+
+/*
+ * No TAS, so spinlocks are implemented as PGSemaphores.
+ */
+
+#ifndef HAVE_ATOMICS
+#define NUM_SIMULATION_SEMAPHORES (NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES)
+#else
+#define NUM_SIMULATION_SEMAPHORES (NUM_SPINLOCK_SEMAPHORES)
+#endif /* DISABLE_ATOMICS */
+
 PGSemaphore *SpinlockSemaArray;
-#endif
+
+#else							/* !HAVE_SPINLOCKS */
+
+#define NUM_SIMULATION_SEMAPHORES 0
+
+#endif							/* HAVE_SPINLOCKS */
 
 /*
  * Report the amount of shared memory needed to store semaphores for spinlock
@@ -38,34 +54,19 @@ PGSemaphore *SpinlockSemaArray;
 Size
 SpinlockSemaSize(void)
 {
-	return SpinlockSemas() * sizeof(PGSemaphore);
+	return NUM_SIMULATION_SEMAPHORES * sizeof(PGSemaphore);
 }
 
-#ifdef HAVE_SPINLOCKS
-
 /*
  * Report number of semaphores needed to support spinlocks.
  */
 int
 SpinlockSemas(void)
 {
-	return 0;
+	return NUM_SIMULATION_SEMAPHORES;
 }
-#else							/* !HAVE_SPINLOCKS */
 
-/*
- * No TAS, so spinlocks are implemented as PGSemaphores.
- */
-
-
-/*
- * Report number of semaphores needed to support spinlocks.
- */
-int
-SpinlockSemas(void)
-{
-	return NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES;
-}
+#ifndef HAVE_SPINLOCKS
 
 /*
  * Initialize spinlock emulation.
@@ -92,23 +93,59 @@ SpinlockSemaInit(void)
 /*
  * s_lock.h hardware-spinlock emulation using semaphores
  *
- * We map all spinlocks onto a set of NUM_SPINLOCK_SEMAPHORES semaphores.
- * It's okay to map multiple spinlocks onto one semaphore because no process
- * should ever hold more than one at a time.  We just need enough semaphores
- * so that we aren't adding too much extra contention from that.
+ * We map all spinlocks onto NUM_SIMULATION_SEMAPHORES semaphores.  It's okay to
+ * map multiple spinlocks onto one semaphore because no process should ever
+ * hold more than one at a time.  We just need enough semaphores so that we
+ * aren't adding too much extra contention from that.
+ *
+ * There is one exception to the restriction of only holding one spinlock at a
+ * time, which is that it's ok if emulated atomic operations are nested inside
+ * spinlocks. To avoid the danger of spinlocks and atomic using the same sema,
+ * we make sure "normal" spinlocks and atomics backed by spinlocks use
+ * distinct semaphores (see the nested argument to s_init_lock_sema).
  *
  * slock_t is just an int for this implementation; it holds the spinlock
- * number from 1..NUM_SPINLOCK_SEMAPHORES.  We intentionally ensure that 0
+ * number from 1..NUM_SIMULATION_SEMAPHORES.  We intentionally ensure that 0
  * is not a valid value, so that testing with this code can help find
  * failures to initialize spinlocks.
  */
 
+static inline void
+s_check_valid(int lockndx)
+{
+	if (unlikely(lockndx <= 0 || lockndx > NUM_SIMULATION_SEMAPHORES))
+		elog(ERROR, "invalid spinlock number: %d", lockndx);
+}
+
 void
 s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
 	static uint32 counter = 0;
+	uint32		offset;
+	uint32		sema_total;
+	uint32		idx;
 
-	*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
+	if (nested)
+	{
+		/*
+		 * To allow nesting atomics inside spinlocked sections, use a
+		 * different spinlock. See comment above.
+		 */
+		offset = 1 + NUM_SPINLOCK_SEMAPHORES;
+		sema_total = NUM_ATOMICS_SEMAPHORES;
+	}
+	else
+	{
+		offset = 1;
+		sema_total = NUM_SPINLOCK_SEMAPHORES;
+	}
+
+	idx = (counter++ % sema_total) + offset;
+
+	/* double check we did things correctly */
+	s_check_valid(idx);
+
+	*lock = idx;
 }
 
 void
@@ -116,8 +153,8 @@ s_unlock_sema(volatile slock_t *lock)
 {
 	int			lockndx = *lock;
 
-	if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
-		elog(ERROR, "invalid spinlock number: %d", lockndx);
+	s_check_valid(lockndx);
+
 	PGSemaphoreUnlock(SpinlockSemaArray[lockndx - 1]);
 }
 
@@ -134,8 +171,8 @@ tas_sema(volatile slock_t *lock)
 {
 	int			lockndx = *lock;
 
-	if (lockndx <= 0 || lockndx > NUM_SPINLOCK_SEMAPHORES)
-		elog(ERROR, "invalid spinlock number: %d", lockndx);
+	s_check_valid(lockndx);
+
 	/* Note that TAS macros return 0 if *success* */
 	return !PGSemaphoreTryLock(SpinlockSemaArray[lockndx - 1]);
 }
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index a48f9de2532..231aab9d569 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -794,6 +794,47 @@ test_atomic_uint64(void)
 	EXPECT_EQ_U64(pg_atomic_fetch_and_u64(&var, ~0), 0);
 }
 
+/*
+ * Verify that performing atomic ops inside a spinlock isn't a
+ * problem. Realistically that's only going to be a problem when both
+ * --disable-spinlocks and --disable-atomics are used, but it's cheap enough
+ * to just always test.
+ *
+ * The test works by initializing enough atomics that we'd conflict if there
+ * were an overlap between a spinlock and an atomic by holding a spinlock
+ * while manipulating more than NUM_SPINLOCK_SEMAPHORES atomics.
+ *
+ * NUM_TEST_ATOMICS doesn't really need to be more than
+ * NUM_SPINLOCK_SEMAPHORES, but it seems better to test a bit more
+ * extensively.
+ */
+static void
+test_atomic_spin_nest(void)
+{
+	slock_t lock;
+#define NUM_TEST_ATOMICS (NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES + 27)
+	pg_atomic_uint32 atomics[NUM_TEST_ATOMICS];
+
+	SpinLockInit(&lock);
+
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+		pg_atomic_init_u32(&atomics[i], 0);
+
+	/* just so it's not all zeroes */
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+		EXPECT_EQ_U32(pg_atomic_fetch_add_u32(&atomics[i], i), 0);
+
+	/* test whether we can do atomic op with lock held */
+	SpinLockAcquire(&lock);
+	for (int i = 0; i < NUM_TEST_ATOMICS; i++)
+	{
+		EXPECT_EQ_U32(pg_atomic_fetch_sub_u32(&atomics[i], i), i);
+		EXPECT_EQ_U32(pg_atomic_read_u32(&atomics[i]), 0);
+	}
+	SpinLockRelease(&lock);
+}
+#undef NUM_TEST_ATOMICS
+
 static void
 test_spinlock(void)
 {
@@ -891,6 +932,8 @@ test_atomic_ops(PG_FUNCTION_ARGS)
 
 	test_atomic_uint64();
 
+	test_atomic_spin_nest();
+
 	/* XXX: Is there a better location for this? */
 	test_spinlock();
 
-- 
2.25.0.114.g5b0ca878e0

#29

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#28)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Mon, Jun 15, 2020 at 9:37 PM Andres Freund <andres@anarazel.de> wrote:

What do you think about 0002?

With regard to the cost of the expensive test in 0003, I'm somewhat
inclined to add that to the buildfarm for a few days and see how it
actually affects the few bf animals without atomics. We can rip it out
after we got some additional coverage (or leave it in if it turns out to
be cheap enough in comparison).

I looked over these patches briefly today. I don't have any objection
to 0001 or 0002. I think 0003 looks a little strange: it seems to be
testing things that might be implementation details of other things,
and I'm not sure that's really correct. In particular:

+ /* and that "contended" acquisition works */
+ s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+ S_UNLOCK(&struct_w_lock.lock);

I didn't think we had formally promised that s_lock() is actually
defined or working on all platforms.

More generally, I don't think it's entirely clear what all of these
tests are testing. Like, I can see that data_before and data_after are
intended to test that the lock actually fits in the space allowed for
it, but at the same time, I think empty implementations of all of
these functions would pass regression, as would many horribly or
subtly buggy implementations. For example, consider this:

+ /* test basic operations via the SpinLock* API */
+ SpinLockInit(&struct_w_lock.lock);
+ SpinLockAcquire(&struct_w_lock.lock);
+ SpinLockRelease(&struct_w_lock.lock);

What does it look like for this test to fail? I guess one of those
operations has to fail an assert or hang forever, because it's not
like we're checking the return value. So I feel like the intent of
these tests isn't entirely clear, and should probably be explained
better, at a minimum -- and perhaps we should think harder about what
a good testing framework would look like. I would rather have tests
that either pass or fail and report a result explicitly, rather than
tests that rely on hangs or crashes.

Parenthetically, "cyle" != "cycle".

I don't have any real complaints about the functionality of 0004 on a
quick read-through, but I'm again a bit skeptical of the tests. Not as
much as with 0003, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Andres Freund (#28)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On 2020-Jun-15, Andres Freund wrote:

Also, I wonder if someone would be willing to set up a BF animal for this.

FWIW, I've requested a buildfarm animal id for this a few days ago, but
haven't received a response yet...

I did send it out, with name rorqual -- didn't you get that? Will send
the secret separately.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#31

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#29)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-16 14:59:19 -0400, Robert Haas wrote:

On Mon, Jun 15, 2020 at 9:37 PM Andres Freund <andres@anarazel.de> wrote:

What do you think about 0002?

With regard to the cost of the expensive test in 0003, I'm somewhat
inclined to add that to the buildfarm for a few days and see how it
actually affects the few bf animals without atomics. We can rip it out
after we got some additional coverage (or leave it in if it turns out to
be cheap enough in comparison).

I looked over these patches briefly today. I don't have any objection
to 0001 or 0002.

Cool. I was mainly interested in those for now.

I think 0003 looks a little strange: it seems to be
testing things that might be implementation details of other things,
and I'm not sure that's really correct. In particular:

My main motivation was to have something that runs more often than than
the embeded test in s_lock.c's that nobody ever runs (they wouldn't even
pass with disabled spinlocks, as S_LOCK_FREE isn't implemented).

+ /* and that "contended" acquisition works */
+ s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+ S_UNLOCK(&struct_w_lock.lock);
I didn't think we had formally promised that s_lock() is actually
defined or working on all platforms.

Hm? Isn't s_lock the, as its comment says, "platform-independent portion
of waiting for a spinlock."? I also don't think we need to purely
follow external APIs in internal tests.

More generally, I don't think it's entirely clear what all of these
tests are testing. Like, I can see that data_before and data_after are
intended to test that the lock actually fits in the space allowed for
it, but at the same time, I think empty implementations of all of
these functions would pass regression, as would many horribly or
subtly buggy implementations.

Sure, there's a lot that'd pass. But it's more than we had before. It
did catch a bug much quicker than I'd have otherwise found it, FWIW.

I don't think an empty implementation would pass btw, as long as TAS is
defined.

So I feel like the intent of these tests isn't entirely clear, and
should probably be explained better, at a minimum -- and perhaps we
should think harder about what a good testing framework would look
like.

Yea, we could use something better. But I don't see that happening
quickly, and having something seems better than nothing.

I would rather have tests that either pass or fail and report a result
explicitly, rather than tests that rely on hangs or crashes.

That seems quite hard to achieve. I really just wanted to have something
I can do some very basic tests to catch issues quicker.

The atomics tests found numerous issues btw, despite also not testing
concurrency.

I think we generally have way too few of such trivial tests. They can
find plenty "real world" issues, but more importantly make it much
quicker to iterate when working on some piece of code.

I don't have any real complaints about the functionality of 0004 on a
quick read-through, but I'm again a bit skeptical of the tests. Not as
much as with 0003, though.

Without the tests I couldn't even reproduce a deadlock due to the
nesting. So they imo are pretty essential?

Greetings,

Andres Freund

#32

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#31)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Tue, Jun 16, 2020 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

I think 0003 looks a little strange: it seems to be
testing things that might be implementation details of other things,
and I'm not sure that's really correct. In particular:

My main motivation was to have something that runs more often than than
the embeded test in s_lock.c's that nobody ever runs (they wouldn't even
pass with disabled spinlocks, as S_LOCK_FREE isn't implemented).

Sure, that makes sense.

+ /* and that "contended" acquisition works */
+ s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+ S_UNLOCK(&struct_w_lock.lock);
I didn't think we had formally promised that s_lock() is actually
defined or working on all platforms.
Hm? Isn't s_lock the, as its comment says, "platform-independent portion
of waiting for a spinlock."? I also don't think we need to purely
follow external APIs in internal tests.

I feel like we at least didn't use to use that on all platforms, but I
might be misremembering. It seems odd and confusing that we have both
S_LOCK() and s_lock(), anyway. Differentiating functions based on case
is not great practice.

Sure, there's a lot that'd pass. But it's more than we had before. It
did catch a bug much quicker than I'd have otherwise found it, FWIW.

I don't think an empty implementation would pass btw, as long as TAS is
defined.

Fair enough.

Yea, we could use something better. But I don't see that happening
quickly, and having something seems better than nothing.

That seems quite hard to achieve. I really just wanted to have something
I can do some very basic tests to catch issues quicker.

The atomics tests found numerous issues btw, despite also not testing
concurrency.

I think we generally have way too few of such trivial tests. They can
find plenty "real world" issues, but more importantly make it much
quicker to iterate when working on some piece of code.

Without the tests I couldn't even reproduce a deadlock due to the
nesting. So they imo are pretty essential?

I'm not telling you not to commit these; I'm just more skeptical of
whether they are the right approach than you seem to be. But that's
OK: people can like different things, and I don't know exactly what
would be better anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

andres@anarazel.de

over 5 years ago

In reply to: Robert Haas (#32)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-17 10:34:31 -0400, Robert Haas wrote:

On Tue, Jun 16, 2020 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

I think 0003 looks a little strange: it seems to be
testing things that might be implementation details of other things,
and I'm not sure that's really correct. In particular:

Hm? Isn't s_lock the, as its comment says, "platform-independent portion
of waiting for a spinlock."? I also don't think we need to purely
follow external APIs in internal tests.

I feel like we at least didn't use to use that on all platforms, but I
might be misremembering.

There's only one definition of S_LOCK, and s_lock is the only spinlock
related user of perform_spin_delay(). So I don't think so?

It seems odd and confusing that we have both
S_LOCK() and s_lock(), anyway. Differentiating functions based on case
is not great practice.

It's a terrible idea, yes. Since we don't actually have any non-default
implementations of S_LOCK, perhaps we should just rip it out? It'd
probably be clearer if SpinLockAcquire() would be what uses TAS() and
falls back to s_lock (best renamed to s_lock_slowpath or such).

It'd perhaps also be good to make SpinLockAcquire() a static inline
instead of a #define, so it can be properly attributed in debuggers and
profilers.

Greetings,

Andres Freund

#34

robertmhaas@gmail.com

over 5 years ago

In reply to: Andres Freund (#33)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Wed, Jun 17, 2020 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:

It seems odd and confusing that we have both
S_LOCK() and s_lock(), anyway. Differentiating functions based on case
is not great practice.

It's a terrible idea, yes. Since we don't actually have any non-default
implementations of S_LOCK, perhaps we should just rip it out?

I think we should rip out the conditional nature of the definition and
fix the comments. I don't think I prefer getting rid of it completely.

But then again on the other hand, what's the point of this crap anyway:

#define SpinLockInit(lock) S_INIT_LOCK(lock)
#define SpinLockAcquire(lock) S_LOCK(lock)
#define SpinLockRelease(lock) S_UNLOCK(lock)
#define SpinLockFree(lock) S_LOCK_FREE(lock)

This seems like it's straight out of the department of pointless
abstraction layers. Maybe we should remove all of the S_WHATEVER()
stuff and just define SpinLockAcquire() where we currently define
S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.

And, as you say, make them static inline functions while we're at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#34)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Robert Haas <robertmhaas@gmail.com> writes:

This seems like it's straight out of the department of pointless
abstraction layers. Maybe we should remove all of the S_WHATEVER()
stuff and just define SpinLockAcquire() where we currently define
S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.
And, as you say, make them static inline functions while we're at it.

The macros are kind of necessary unless you want to make s_lock.h
a bunch messier, because we use #ifdef tests on them.

We could get rid of the double layer of macros, sure, but TBH that
sounds like change for the sake of change rather than a useful
improvement.

regards, tom lane

#36

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#35)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Wed, Jun 17, 2020 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

This seems like it's straight out of the department of pointless
abstraction layers. Maybe we should remove all of the S_WHATEVER()
stuff and just define SpinLockAcquire() where we currently define
S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.
And, as you say, make them static inline functions while we're at it.

The macros are kind of necessary unless you want to make s_lock.h
a bunch messier, because we use #ifdef tests on them.

Where?

We could get rid of the double layer of macros, sure, but TBH that
sounds like change for the sake of change rather than a useful
improvement.

Really? Multiple layers of macros seem like they pretty clearly make
the source code harder to understand. There are plenty of places where
such devices are necessary for one reason or another, but it doesn't
seem like something we ought to keep around for no reason.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#36)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 17, 2020 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The macros are kind of necessary unless you want to make s_lock.h
a bunch messier, because we use #ifdef tests on them.

Where?

See the "Default Definitions", down near the end.

We could get rid of the double layer of macros, sure, but TBH that
sounds like change for the sake of change rather than a useful
improvement.

Really? Multiple layers of macros seem like they pretty clearly make
the source code harder to understand. There are plenty of places where
such devices are necessary for one reason or another, but it doesn't
seem like something we ought to keep around for no reason.

I wouldn't object to making the outer-layer macros in spin.h into static
inlines; as mentioned, that might have some debugging benefits. But I
think messing with s_lock.h for marginal cosmetic reasons is a foolish
idea. For one thing, there's no way whoever does it can verify all the
architecture-specific stanzas. (I don't think we even have all of them
covered in the buildfarm.)

regards, tom lane

#38

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#37)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Wed, Jun 17, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

See the "Default Definitions", down near the end.

Oh, yeah. I didn't realize you meant just inside this file itself.
That is slightly awkward. I initially thought there was no problem
because there seem to be no remaining non-default definitions of
S_LOCK, but I now see that the other macros still do have some
non-default definitions. Hmm.

Really? Multiple layers of macros seem like they pretty clearly make
the source code harder to understand. There are plenty of places where
such devices are necessary for one reason or another, but it doesn't
seem like something we ought to keep around for no reason.

I wouldn't object to making the outer-layer macros in spin.h into static
inlines; as mentioned, that might have some debugging benefits. But I
think messing with s_lock.h for marginal cosmetic reasons is a foolish
idea. For one thing, there's no way whoever does it can verify all the
architecture-specific stanzas. (I don't think we even have all of them
covered in the buildfarm.)

It would be a pretty mechanical change to use a separate preprocessor
symbol for the conditional and just define the static inline functions
on the spot. There might be one or two goofs, but if those platforms
are not in the buildfarm, they're either dead and they don't matter,
or someone will tell us what we did wrong. I don't know. I don't have
a huge desire to spend time cleaning up s_lock.h and I do think it's
better not to churn stuff around just for the heck of it, but I'm also
sympathetic to Andres's point that using macros everywhere is
debugger-unfriendly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#38)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jun 17, 2020 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wouldn't object to making the outer-layer macros in spin.h into static
inlines; as mentioned, that might have some debugging benefits. But I
think messing with s_lock.h for marginal cosmetic reasons is a foolish
idea. For one thing, there's no way whoever does it can verify all the
architecture-specific stanzas. (I don't think we even have all of them
covered in the buildfarm.)

It would be a pretty mechanical change to use a separate preprocessor
symbol for the conditional and just define the static inline functions
on the spot. There might be one or two goofs, but if those platforms
are not in the buildfarm, they're either dead and they don't matter,
or someone will tell us what we did wrong. I don't know. I don't have
a huge desire to spend time cleaning up s_lock.h and I do think it's
better not to churn stuff around just for the heck of it, but I'm also
sympathetic to Andres's point that using macros everywhere is
debugger-unfriendly.

Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
sufficient to address that point, with no need to touch s_lock.h at all?

regards, tom lane

#40

robertmhaas@gmail.com

over 5 years ago

In reply to: Tom Lane (#39)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

On Thu, Jun 18, 2020 at 11:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
sufficient to address that point, with no need to touch s_lock.h at all?

I mean, wouldn't you then end up with a bunch of 1-line functions
where you can step into the function but not through whatever
individual things it does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Robert Haas (#40)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 18, 2020 at 11:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
sufficient to address that point, with no need to touch s_lock.h at all?

I mean, wouldn't you then end up with a bunch of 1-line functions
where you can step into the function but not through whatever
individual things it does?

Not following your point. The s_lock.h implementations tend to be either
simple C statements ("*lock = 0") or asm blocks; if you feel a need to
step through them you're going to be resorting to "si" anyway.

I think the main usefulness of doing anything here would be (a) separating
the spinlock infrastructure from callers and (b) ensuring that we have a
declared argument type, and single-evaluation semantics, for the spinlock
function parameters. Both of those are adequately addressed by fixing
spin.h, IMO anyway.

regards, tom lane

#42

andres@anarazel.de

over 5 years ago

In reply to: Tom Lane (#41)

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

Hi,

On 2020-06-18 12:29:40 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 18, 2020 at 11:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Sure, but wouldn't making the SpinLockAcquire layer into static inlines be
sufficient to address that point, with no need to touch s_lock.h at all?

I mean, wouldn't you then end up with a bunch of 1-line functions
where you can step into the function but not through whatever
individual things it does?

Not following your point. The s_lock.h implementations tend to be either
simple C statements ("*lock = 0") or asm blocks; if you feel a need to
step through them you're going to be resorting to "si" anyway.

I agree on that.

I do think it'd be better to not have the S_LOCK macro though (but have
TAS/TAS_SPIN as we do now). And instead have SpinLockAcquire() call
s_lock() (best renamed to something a bit more meaningful). Makes the
code a bit easier to understand (no S_LOCK vs s_lock) and yields simpler
macros.

There's currently no meaningful ifdefs for S_LOCK (in contrast to
e.g. S_UNLOCK), so I don't see it making s_lock.h more complicated.

I think part of the issue here is that the naming of the s_lock exposed
macros is confusing. We have S_INIT_LOCK, TAS, SPIN_DELAY, TAS,
TAS_SPIN, S_UNLOCK, S_LOCK_FREE that are essentially hardware
dependent. But then there's S_LOCK which basically isn't.

It may have made some sense when the file was originally written, if one
imagines that S_ is the only external API, and the rest is
implementation. But given that s_lock() uses TAS() directly (and says
"platform-independent portion of waiting for a spinlock"), and that we
only have one definition of S_UNLOCK that doesn't seem quite right.

I think the main usefulness of doing anything here would be (a) separating
the spinlock infrastructure from callers and (b) ensuring that we have a
declared argument type, and single-evaluation semantics, for the spinlock
function parameters. Both of those are adequately addressed by fixing
spin.h, IMO anyway.

The abstraction point made me grep for includes of s_lock.h. Seems we
have some unnecessary includes of s_lock.h.

lwlock.h doesn't need to include spinlock related things anymore, and
hasn't for years, the spinlocks are replaced with atomics. That seems
pretty obvious. I'm gonna fix that in master, unless somebody thinks we
should do that more widely?

But we also have s_lock.h includes in condition_variable.h and
main.c. Seems the former should instead include spin.h and the latter
include should just be removed?

All of this however makes me wonder whether it's worth polishing
spinlocks instead of just ripping them out. Obviously we'd still need a
fallback for atomics, but it not be hard to just replace the
spinlock use with either "smaller" atomics or semaphores.

Greetings,

Andres Freund

#43