Issue with the PRNG used by Postgres

Started by Parag Paulalmost 2 years ago59 messages
Jump to latest
#1Parag Paul
parag.paul@gmail.com

hi all,
We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.

My expectation is that, this RNG will be fair to all backends.
After studying the underlying hash function that PG uses,"xoroshiro" which
is from the LFSR family, it seems, that it has failed a bunch of
statistical tests and has been mentioned in various blogs.
The one that caught my attention was :
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
This mentions the zeroland problem and the big belt when in low Hamming
index. What that means, is that when the status reaches a low hamming index
state(state where it is mainly all 0's and less 1's), it takes a while to
get back to regular entropy.

Here is the code from s_lock.c that shows how we add delay for the backoff,
before the TAS is tried again.
Also, there is a bug here in all likelihood, as the author wanted to round
off the value coming out of the equation by adding 0.5. Integer casting in
c/c++ will drop the decimal part out((int)12.9999 = 12).
And if the RNG keeps producing values close to 0, the delay will never
increase(as it will keep integer casting to 0) and end up spinning 1000
times within 1second itself. That is one of the reasons why we could
reproduce this by forcing low RNG numbers in our test cases.
So, steps that I would recommend
1. Firstly, use ceiling for round off
2. cur_delay should increase by max(1000, RNG generated value)
3. Change the hashing function
4. Capture the new 0 state, and then try to reseed.

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);

#2Andres Freund
andres@anarazel.de
In reply to: Parag Paul (#1)
Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-08 22:52:09 -0700, Parag Paul wrote:

We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.

ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.

Greetings,

Andres Freund

#3Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: Issue with the PRNG used by Postgres

On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:

ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

If I'm reading this correctly, this was introduced here:

commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date: Fri Feb 3 12:45:47 2006 +0000

Update random() usage so ranges are inclusive/exclusive as required.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Parag Paul
parag.paul@gmail.com
In reply to: Andres Freund (#2)
Re: Issue with the PRNG used by Postgres

Hi Andres,
This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.

Also, I believe the rounding off error still remains within the RNG. I will
send a patch today.

Thanks for the response.
-Parag

On Tue, Apr 9, 2024 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:

Show quoted text

Hi,

On 2024-04-08 22:52:09 -0700, Parag Paul wrote:

We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.

ISTM that the fix here is to not use a spinlock for whatever the
contention is
on, rather than improve the RNG.

Greetings,

Andres Freund

#5Parag Paul
parag.paul@gmail.com
In reply to: Robert Haas (#3)
Re: Issue with the PRNG used by Postgres

Thank you Robert.
I am in the process of patching this.
-Parag

On Wed, Apr 10, 2024 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:

ISTM that the fix here is to not use a spinlock for whatever the

contention is

on, rather than improve the RNG.

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

If I'm reading this correctly, this was introduced here:

commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date: Fri Feb 3 12:45:47 2006 +0000

Update random() usage so ranges are inclusive/exclusive as required.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: Issue with the PRNG used by Postgres

Robert Haas <robertmhaas@gmail.com> writes:

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);

which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

regards, tom lane

#7Parag Paul
parag.paul@gmail.com
In reply to: Tom Lane (#6)
Re: Issue with the PRNG used by Postgres

hi Tom,
First of all thanks for you response. I did not misread it. The 0.5 is
added to the result of the multiplication which then uses C integer
casting, which does not round off, but just drops the decimal portion.

status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);

So, if RNG generated 0.0000001 and cur_delay =1000.
Result will be
1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<-- back to the same value
and there is no change after that starts happening, if the RNG is in the
low hamming index state. If avalanche happens soon, then it will correct it
self, but in the mean time, we have a stuck_spin_lock PANIC that could
bring down a production server.
-Parag

On Wed, Apr 10, 2024 at 8:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Robert Haas <robertmhaas@gmail.com> writes:

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state)
+ 0.5);

which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: Issue with the PRNG used by Postgres

On Wed, Apr 10, 2024 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, I think you are misreading it, because the assignment is += not =.
The present coding is

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);

which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.

Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?
Isn't this a random fraction between 0X and 1X? Or am I still full of
garbage?

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0. Storms of spinlock contention can and do happen on real-world
production servers, and trying to guess which spinlocks might be
affected before it happens is difficult. Until we fully eliminate all
spinlocks from our code base, the theoretical possibility of such
storms remains, and if making sure that the increment here is >0
mitigates that, then I think we should. Mind you, I don't think that
the original poster has necessarily provided convincing or complete
evidence to justify a change here, but I find it odd that you and
Andres seem to want to dismiss it out of hand.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Parag Paul (#7)
Re: Issue with the PRNG used by Postgres

Parag Paul <parag.paul@gmail.com> writes:

So, if RNG generated 0.0000001 and cur_delay =1000.
Result will be
1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<-- back to the same value

Yes, with a sufficiently small RNG result, the sleep delay will not
increase that time through the loop. So what? It's supposed to be
a random amount of backoff, and I don't see why "random" shouldn't
occasionally include "zero". Why would requiring the delay to
increase improve matters at all?

Now, if the xoroshiro RNG is so bad that there's a strong likelihood
that multiple backends will sit on cur_delay = MIN_DELAY_USEC for
a long time, then yeah we could have a thundering-herd problem with
contention for the spinlock. You've provided no reason to think
that the probability of that is more than microscopic, though.
(If it is more than microscopic, then we are going to have
nonrandomness problems in a lot of other places, so I'd lean
towards revising the RNG not band-aiding s_lock. We've seen
nothing to suggest such problems, though.)

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: Issue with the PRNG used by Postgres

Robert Haas <robertmhaas@gmail.com> writes:

Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?

I think the comment is meant to say that the new delay value will be
1X to 2X the old value. If you want to suggest different phrasing,
feel free.

It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0.

As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Parag Paul (#4)
Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-10 07:55:16 -0700, Parag Paul wrote:

This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.

Please provide a profile of a problematic workload. I'm fairly suspicious
of the claim that RNG unfairness is a real part of the problem here.

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: Issue with the PRNG used by Postgres

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */

/* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;

init_local_spin_delay(&delayStatus);

while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}

/*
* Retry. The lock might obviously already be re-acquired by the time
* we're attempting to get it again.
*/
}

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: Issue with the PRNG used by Postgres

On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.

The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:

Reps | Value
-----+--------
84 | 0x00
10 | 0x2d
6 | 0xa0
6 | 0x68
6 | 0x40
6 | 0x02
5 | 0x0b
4 | 0x05
4 | 0x01
3 | 0xe1
3 | 0x5a
3 | 0x41
3 | 0x20
3 | 0x16

That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.

Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Parag Paul
parag.paul@gmail.com
In reply to: Tom Lane (#12)
Re: Issue with the PRNG used by Postgres

hi Tom,
Sorry for the delayed response. I was collecting of the data from my
production servers.

The reason why this could be a problem is a flaw in the RNG with the
enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our
code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say
backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say
backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock
stuck condition.

Extensive analysis here -
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
-Parag

On Wed, Apr 10, 2024 at 9:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */

/* and then spin without atomic operations until lock is released
*/
{
SpinDelayStatus delayStatus;

init_local_spin_delay(&delayStatus);

while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}

/*
* Retry. The lock might obviously already be re-acquired by the
time
* we're attempting to get it again.
*/
}

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

regards, tom lane

Attachments:

image001.pngimage/png; name=image001.pngDownload
#15Parag Paul
parag.paul@gmail.com
In reply to: Robert Haas (#13)
Re: Issue with the PRNG used by Postgres

hi Robert,
We are using xoroshiro128 and not moved to the next state of art.
We did see a lot of low values as put in my last message.
-Parag

On Wed, Apr 10, 2024 at 9:37 AM Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.

The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:

Reps | Value
-----+--------
84 | 0x00
10 | 0x2d
6 | 0xa0
6 | 0x68
6 | 0x40
6 | 0x02
5 | 0x0b
4 | 0x05
4 | 0x01
3 | 0xe1
3 | 0x5a
3 | 0x41
3 | 0x20
3 | 0x16

That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.

Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Robert Haas
robertmhaas@gmail.com
In reply to: Parag Paul (#14)
Re: Issue with the PRNG used by Postgres

On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:

The reason why this could be a problem is a flaw in the RNG with the enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock stuck condition.

This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Parag Paul
parag.paul@gmail.com
In reply to: Robert Haas (#16)
Re: Issue with the PRNG used by Postgres

Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
This is not reproducible very easily. I feel that we should at least shield
ourselves with the following change, so that we at least increase the delay
by 1000us every time. We will follow a linear back off, but better than no
backoff.
status->cur_delay += max(1000, (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5));

On Wed, Apr 10, 2024 at 9:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:

The reason why this could be a problem is a flaw in the RNG with the

enlarged Hamming belt.

I attached an image here, with the RNG outputs from 2 backends. I ran

our code for weeks, and collected ther

values generated by the RNG over many backends. The one in Green (say

backend id 600), stopped flapping values and

only produced low (near 0 ) values for half an hour, whereas the

Blue(say backend 700), kept generating good values and had

a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock

stuck condition.

This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#12)
Re: Issue with the PRNG used by Postgres

I wrote:

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

After thinking about this some more, it is fairly clear that that *is*
a mistake that can cause a thundering-herd problem. Assume we have
two or more backends waiting in perform_spin_delay, and for whatever
reason the scheduler wakes them up simultaneously. They see the
LW_FLAG_LOCKED bit clear (because whoever had the lock when they went
to sleep is long gone) and iterate the outer loop. One gets the lock;
the rest go back to sleep. And they will all sleep exactly
MIN_DELAY_USEC, because they all reset their SpinDelayStatus.
Lather, rinse, repeat. If the s_lock code were being used as
intended, they would soon acquire different cur_delay settings;
but that never happens. That is not the RNG's fault.

So I think we need something like the attached.

regards, tom lane

Attachments:

fix-improper-spin-delay-logic.patchtext/x-diff; charset=us-ascii; name=fix-improper-spin-delay-logic.patchDownload+27-24
#19Andres Freund
andres@anarazel.de
In reply to: Parag Paul (#17)
Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-10 09:48:42 -0700, Parag Paul wrote:

Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.

What version of PG was this? I think it's much more likely that you're
hitting a bug that caused a lot more contention inside lwlocks. That was fixed
for 16+ in a4adc31f690 on 2022-11-20, but only backpatched to 12-15 on
2024-01-18.

Greetings,

Andres Freund

#20Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-10 12:28:10 -0400, Tom Lane wrote:

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */

/* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;

init_local_spin_delay(&delayStatus);

while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}

/*
* Retry. The lock might obviously already be re-acquired by the time
* we're attempting to get it again.
*/
}

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
to matter. I don't think just pulling out the spin delay would be the right
thing though, because that'd mean we'd initialize it even in the much much
more common case of there not being any contention. I think we'll have to add
a separate fetch_or before the outer loop.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

My suspicion is that it's a4adc31f690 which we only recently backpatched to
< 16.

Greetings,

Andres Freund

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Parag Paul (#17)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#20)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
#26Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#29)
#33Andrey Borodin
amborodin@acm.org
In reply to: Parag Paul (#17)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#32)
#35Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#34)
#37Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#32)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
#39Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#35)
#42Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#40)
#43Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#41)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#45)
#47Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#47)
#50Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#49)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#51)
#53Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#52)
#54Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#50)
#55Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#49)
#56Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#56)
#58Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#57)
#59Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#58)