Issue with the PRNG used by Postgres
hi all,
We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.
My expectation is that, this RNG will be fair to all backends.
After studying the underlying hash function that PG uses,"xoroshiro" which
is from the LFSR family, it seems, that it has failed a bunch of
statistical tests and has been mentioned in various blogs.
The one that caught my attention was :
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
This mentions the zeroland problem and the big belt when in low Hamming
index. What that means, is that when the status reaches a low hamming index
state(state where it is mainly all 0's and less 1's), it takes a while to
get back to regular entropy.
Here is the code from s_lock.c that shows how we add delay for the backoff,
before the TAS is tried again.
Also, there is a bug here in all likelihood, as the author wanted to round
off the value coming out of the equation by adding 0.5. Integer casting in
c/c++ will drop the decimal part out((int)12.9999 = 12).
And if the RNG keeps producing values close to 0, the delay will never
increase(as it will keep integer casting to 0) and end up spinning 1000
times within 1second itself. That is one of the reasons why we could
reproduce this by forcing low RNG numbers in our test cases.
So, steps that I would recommend
1. Firstly, use ceiling for round off
2. cur_delay should increase by max(1000, RNG generated value)
3. Change the hashing function
4. Capture the new 0 state, and then try to reseed.
/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);
Hi,
On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.
ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.
Greetings,
Andres Freund
On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.
I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.
If I'm reading this correctly, this was introduced here:
commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date: Fri Feb 3 12:45:47 2006 +0000
Update random() usage so ranges are inclusive/exclusive as required.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Andres,
This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.
Also, I believe the rounding off error still remains within the RNG. I will
send a patch today.
Thanks for the response.
-Parag
On Tue, Apr 9, 2024 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:
Show quoted text
Hi,
On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
We have an interesting problem, where PG went to PANIC due to stuck
spinlock case.
On careful analysis and hours of trying to reproduce this(something that
showed up in production after almost 2 weeks of stress run), I did some
statistical analysis on the RNG generator that PG uses to create the
backoff for the spin locks.ISTM that the fix here is to not use a spinlock for whatever the
contention is
on, rather than improve the RNG.Greetings,
Andres Freund
Thank you Robert.
I am in the process of patching this.
-Parag
On Wed, Apr 10, 2024 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
ISTM that the fix here is to not use a spinlock for whatever the
contention is
on, rather than improve the RNG.
I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.If I'm reading this correctly, this was introduced here:
commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date: Fri Feb 3 12:45:47 2006 +0000Update random() usage so ranges are inclusive/exclusive as required.
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.
No, I think you are misreading it, because the assignment is += not =.
The present coding is
/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);
which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.
In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.
regards, tom lane
hi Tom,
First of all thanks for you response. I did not misread it. The 0.5 is
added to the result of the multiplication which then uses C integer
casting, which does not round off, but just drops the decimal portion.
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5);
So, if RNG generated 0.0000001 and cur_delay =1000.
Result will be
1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<-- back to the same value
and there is no change after that starts happening, if the RNG is in the
low hamming index state. If avalanche happens soon, then it will correct it
self, but in the mean time, we have a stuck_spin_lock PANIC that could
bring down a production server.
-Parag
On Wed, Apr 10, 2024 at 8:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Robert Haas <robertmhaas@gmail.com> writes:
I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.No, I think you are misreading it, because the assignment is += not =.
The present coding is/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state)
+ 0.5);which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.regards, tom lane
On Wed, Apr 10, 2024 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
No, I think you are misreading it, because the assignment is += not =.
The present coding is/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.
Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?
Isn't this a random fraction between 0X and 1X? Or am I still full of
garbage?
In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.
It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0. Storms of spinlock contention can and do happen on real-world
production servers, and trying to guess which spinlocks might be
affected before it happens is difficult. Until we fully eliminate all
spinlocks from our code base, the theoretical possibility of such
storms remains, and if making sure that the increment here is >0
mitigates that, then I think we should. Mind you, I don't think that
the original poster has necessarily provided convincing or complete
evidence to justify a change here, but I find it odd that you and
Andres seem to want to dismiss it out of hand.
--
Robert Haas
EDB: http://www.enterprisedb.com
Parag Paul <parag.paul@gmail.com> writes:
So, if RNG generated 0.0000001 and cur_delay =1000.
Result will be
1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
<-- back to the same value
Yes, with a sufficiently small RNG result, the sleep delay will not
increase that time through the loop. So what? It's supposed to be
a random amount of backoff, and I don't see why "random" shouldn't
occasionally include "zero". Why would requiring the delay to
increase improve matters at all?
Now, if the xoroshiro RNG is so bad that there's a strong likelihood
that multiple backends will sit on cur_delay = MIN_DELAY_USEC for
a long time, then yeah we could have a thundering-herd problem with
contention for the spinlock. You've provided no reason to think
that the probability of that is more than microscopic, though.
(If it is more than microscopic, then we are going to have
nonrandomness problems in a lot of other places, so I'd lean
towards revising the RNG not band-aiding s_lock. We've seen
nothing to suggest such problems, though.)
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?
I think the comment is meant to say that the new delay value will be
1X to 2X the old value. If you want to suggest different phrasing,
feel free.
It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0.
As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.
regards, tom lane
Hi,
On 2024-04-10 07:55:16 -0700, Parag Paul wrote:
This is a little bit more complex than that. The spinlocks are taken in the
LWLock(Mutex) code, when the lock is not available right away.
The spinlock is taken to attach the current backend to the wait list of the
LWLock. This means, that this cannot be controlled.
The repro when it happens, it affects any mutex or LWLock code path, since
the low hamming index can cause problems by removing fairness from the
system.
Please provide a profile of a problematic workload. I'm fairly suspicious
of the claim that RNG unfairness is a real part of the problem here.
Greetings,
Andres Freund
Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:
while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */
/* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;
init_local_spin_delay(&delayStatus);
while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}
/*
* Retry. The lock might obviously already be re-acquired by the time
* we're attempting to get it again.
*/
}
I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.
(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)
regards, tom lane
On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.
The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:
Reps | Value
-----+--------
84 | 0x00
10 | 0x2d
6 | 0xa0
6 | 0x68
6 | 0x40
6 | 0x02
5 | 0x0b
4 | 0x05
4 | 0x01
3 | 0xe1
3 | 0x5a
3 | 0x41
3 | 0x20
3 | 0x16
That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.
Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.
--
Robert Haas
EDB: http://www.enterprisedb.com
hi Tom,
Sorry for the delayed response. I was collecting of the data from my
production servers.
The reason why this could be a problem is a flaw in the RNG with the
enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our
code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say
backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say
backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock
stuck condition.
Extensive analysis here -
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
-Parag
On Wed, Apr 10, 2024 at 9:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock *//* and then spin without atomic operations until lock is released
*/
{
SpinDelayStatus delayStatus;init_local_spin_delay(&delayStatus);
while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}/*
* Retry. The lock might obviously already be re-acquired by the
time
* we're attempting to get it again.
*/
}I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)regards, tom lane
Attachments:
hi Robert,
We are using xoroshiro128 and not moved to the next state of art.
We did see a lot of low values as put in my last message.
-Parag
On Wed, Apr 10, 2024 at 9:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault. Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it. But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:Reps | Value
-----+--------
84 | 0x00
10 | 0x2d
6 | 0xa0
6 | 0x68
6 | 0x40
6 | 0x02
5 | 0x0b
4 | 0x05
4 | 0x01
3 | 0xe1
3 | 0x5a
3 | 0x41
3 | 0x20
3 | 0x16That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:
The reason why this could be a problem is a flaw in the RNG with the enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock stuck condition.
This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.
--
Robert Haas
EDB: http://www.enterprisedb.com
Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
This is not reproducible very easily. I feel that we should at least shield
ourselves with the following change, so that we at least increase the delay
by 1000us every time. We will follow a linear back off, but better than no
backoff.
status->cur_delay += max(1000, (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5));
On Wed, Apr 10, 2024 at 9:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:
The reason why this could be a problem is a flaw in the RNG with the
enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran
our code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say
backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the
Blue(say backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlockstuck condition.
This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.--
Robert Haas
EDB: http://www.enterprisedb.com
I wrote:
I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.
After thinking about this some more, it is fairly clear that that *is*
a mistake that can cause a thundering-herd problem. Assume we have
two or more backends waiting in perform_spin_delay, and for whatever
reason the scheduler wakes them up simultaneously. They see the
LW_FLAG_LOCKED bit clear (because whoever had the lock when they went
to sleep is long gone) and iterate the outer loop. One gets the lock;
the rest go back to sleep. And they will all sleep exactly
MIN_DELAY_USEC, because they all reset their SpinDelayStatus.
Lather, rinse, repeat. If the s_lock code were being used as
intended, they would soon acquire different cur_delay settings;
but that never happens. That is not the RNG's fault.
So I think we need something like the attached.
regards, tom lane
Attachments:
fix-improper-spin-delay-logic.patchtext/x-diff; charset=us-ascii; name=fix-improper-spin-delay-logic.patchDownload+27-24
Hi,
On 2024-04-10 09:48:42 -0700, Parag Paul wrote:
Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
What version of PG was this? I think it's much more likely that you're
hitting a bug that caused a lot more contention inside lwlocks. That was fixed
for 16+ in a4adc31f690 on 2022-11-20, but only backpatched to 12-15 on
2024-01-18.
Greetings,
Andres Freund
Hi,
On 2024-04-10 12:28:10 -0400, Tom Lane wrote:
Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock *//* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;init_local_spin_delay(&delayStatus);
while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}/*
* Retry. The lock might obviously already be re-acquired by the time
* we're attempting to get it again.
*/
}I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.
Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
to matter. I don't think just pulling out the spin delay would be the right
thing though, because that'd mean we'd initialize it even in the much much
more common case of there not being any contention. I think we'll have to add
a separate fetch_or before the outer loop.
(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)
My suspicion is that it's a4adc31f690 which we only recently backpatched to
< 16.
Greetings,
Andres Freund