Condition variable live lock

Started by Thomas Munroover 8 years ago31 messageshackers

thomas.munro@gmail.com

over 8 years ago

Hi hackers,

While debugging a build farm assertion failure after commit 18042840,
and with the assumption that the problem is timing/scheduling
sensitive, I tried hammering the problem workload on a few different
machines and noticed that my slow 2-core test machine fairly regularly
got into a live lock state for tens to millions of milliseconds at a
time when there were 3+ active processes, in here:

int
ConditionVariableBroadcast(ConditionVariable *cv)
{
int nwoken = 0;

/*
* Let's just do this the dumbest way possible. We could try to dequeue
* all the sleepers at once to save spinlock cycles, but it's a bit hard
* to get that right in the face of possible sleep cancelations, and we
* don't want to loop holding the mutex.
*/
while (ConditionVariableSignal(cv))
++nwoken;

return nwoken;
}

The problem is that another backend can be woken up, determine that it
would like to wait for the condition variable again, and then get
itself added to the back of the wait queue *before the above loop has
finished*, so this interprocess ping-pong isn't guaranteed to
terminate. It seems that we'll need something slightly smarter than
the above to avoid that.

I don't currently suspect this phenomenon of being responsible for the
problem I'm hunting, even though it occurs on the only machine I've
been able to reproduce my real problem on. AFAICT the problem
described in this email should deliver arbitrary numbers of spurious
wake-ups wasting arbitrary CPU time but cause no harm that would
affect program correctness. So I didn't try to write a patch to fix
that just yet. I think we should probably back patch a fix when we
have one though, because it could bite Parallel Index Scan in
REL_10_STABLE.

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Thomas Munro (#1)

Re: Condition variable live lock

On Fri, Dec 22, 2017 at 4:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

while (ConditionVariableSignal(cv))
++nwoken;

The problem is that another backend can be woken up, determine that it
would like to wait for the condition variable again, and then get
itself added to the back of the wait queue *before the above loop has
finished*, so this interprocess ping-pong isn't guaranteed to
terminate. It seems that we'll need something slightly smarter than
the above to avoid that.

Here is one way to fix it: track the wait queue size and use that
number to limit the wakeup loop. See attached.

That's unbackpatchable though, because it changes the size of struct
ConditionVariable, potentially breaking extensions compiled against an
earlier point release. Maybe this problem won't really cause problems
in v10 anyway? It requires a particular interaction pattern that
barrier.c produces but more typical client code might not: the awoken
backends keep re-adding themselves because they're waiting for
everyone (including the waker) to do something, but the waker is stuck
in that broadcast loop.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Thomas Munro (#2)

Re: Condition variable live lock

On 2017-12-29 12:16:20 +1300, Thomas Munro wrote:

Here is one way to fix it: track the wait queue size and use that
number to limit the wakeup loop. See attached.

That's unbackpatchable though, because it changes the size of struct
ConditionVariable, potentially breaking extensions compiled against an
earlier point release. Maybe this problem won't really cause problems
in v10 anyway? It requires a particular interaction pattern that
barrier.c produces but more typical client code might not: the awoken
backends keep re-adding themselves because they're waiting for
everyone (including the waker) to do something, but the waker is stuck
in that broadcast loop.

Hm, I'm not quite convinced by this approach. Partially because of the
backpatch issue you mention, partially because using the list length as
a limit doesn't seem quite nice.

Given that the proclist_contains() checks in condition_variable.c are
already racy, I think it might be feasible to collect all procnos to
signal while holding the spinlock, and then signal all of them in one
go.

Obviously it'd be nicer to not hold a spinlock while looping, but that
seems like something we can't fix in the back branches. [insert rant
about never using spinlocks unless there's very very clear convicing
reasons].

- Andres

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Andres Freund (#3)

Re: Condition variable live lock

On Fri, Dec 29, 2017 at 2:38 PM, Andres Freund <andres@anarazel.de> wrote:

Hm, I'm not quite convinced by this approach. Partially because of the
backpatch issue you mention, partially because using the list length as
a limit doesn't seem quite nice.

Seems OK to me. Certainly better than your competing proposal.

Given that the proclist_contains() checks in condition_variable.c are
already racy, I think it might be feasible to collect all procnos to
signal while holding the spinlock, and then signal all of them in one
go.

That doesn't seem very nice at all. Not only does it violate the
coding rule against looping while holding a spinlock, but it seems
that it would require allocating memory while holding one, which is a
non-starter.

Obviously it'd be nicer to not hold a spinlock while looping, but that
seems like something we can't fix in the back branches. [insert rant
about never using spinlocks unless there's very very clear convicing
reasons].

I don't think that's a coding rule that I'd be prepared to endorse.
We've routinely used spinlocks for years in cases where the critical
section was very short, just to keep the overhead down. I think it
works fine in that case, although I admit that I failed to appreciate
how unpleasant the livelock possibilities were in this case.

It's not clear to me that we entirely need a back-patchable fix for
this. It could be that parallel index scan can have the same issue,
but I'm not aware of any user complaints. Parallel bitmap heap only
ever waits once so it's probably fine. If we do need a back-patchable
fix, I suppose slock_t mutex could be replaced by pg_atomic_uint32
state. I think that would avoid changing the size of the structure on
common platforms, though obscure systems with spinlocks > 4 bytes
might be affected.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Robert Haas (#4)

Re: Condition variable live lock

On 2018-01-04 12:39:47 -0500, Robert Haas wrote:

Given that the proclist_contains() checks in condition_variable.c are
already racy, I think it might be feasible to collect all procnos to
signal while holding the spinlock, and then signal all of them in one
go.

That doesn't seem very nice at all. Not only does it violate the
coding rule against looping while holding a spinlock, but it seems
that it would require allocating memory while holding one, which is a
non-starter.

We could just use a sufficiently sized buffer beforehand. There's an
obvious upper boundary, so that shouldn't be a big issue.

Obviously it'd be nicer to not hold a spinlock while looping, but that
seems like something we can't fix in the back branches. [insert rant
about never using spinlocks unless there's very very clear convicing
reasons].

I don't think that's a coding rule that I'd be prepared to endorse.
We've routinely used spinlocks for years in cases where the critical
section was very short, just to keep the overhead down.

The problem is that due to the contention handling they really don't
keep the overhead that low unless you're absolutely absolutely
maximizing for low number of cycles and have very little
contention. Which isn't actually common. I think part of the
conventional wisdom when to use spinlock vs lwlocks went out of the
window once we got better scaling lwlocks.

It's not clear to me that we entirely need a back-patchable fix for
this. It could be that parallel index scan can have the same issue,
but I'm not aware of any user complaints.

I don't think many users are going to be able to diagnose this one, and
it's probably not easily diagnosable even if they complain about
performance.

Greetings,

Andres Freund

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Andres Freund (#5)

Re: Condition variable live lock

Andres Freund <andres@anarazel.de> writes:

On 2018-01-04 12:39:47 -0500, Robert Haas wrote:

Given that the proclist_contains() checks in condition_variable.c are
already racy, I think it might be feasible to collect all procnos to
signal while holding the spinlock, and then signal all of them in one
go.

That doesn't seem very nice at all. Not only does it violate the
coding rule against looping while holding a spinlock, but it seems
that it would require allocating memory while holding one, which is a
non-starter.

We could just use a sufficiently sized buffer beforehand. There's an
obvious upper boundary, so that shouldn't be a big issue.

I share Robert's discomfort with that solution, but it seems to me there
might be a better way. The attached patch uses our own cvWaitLink as a
sentinel to detect when we've woken everybody who was on the wait list
before we arrived. That gives exactly the desired semantics, not just an
approximation to them.

Now, the limitation with this is that we can't be waiting for any *other*
condition variable, because then we'd be trashing our state about that
variable. As coded, we can't be waiting for the target CV either, but
that case could actually be handled if we needed to, as per the comment.
I do not know if this is likely to be a problematic limitation
... discuss. (The patch does survive check-world, FWIW.)

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#6)

Re: Condition variable live lock

On Fri, Jan 5, 2018 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I share Robert's discomfort with that solution, but it seems to me there
might be a better way. The attached patch uses our own cvWaitLink as a
sentinel to detect when we've woken everybody who was on the wait list
before we arrived. That gives exactly the desired semantics, not just an
approximation to them.

Very clever. It works correctly for my test case.

Now, the limitation with this is that we can't be waiting for any *other*
condition variable, because then we'd be trashing our state about that
variable. As coded, we can't be waiting for the target CV either, but
that case could actually be handled if we needed to, as per the comment.
I do not know if this is likely to be a problematic limitation
... discuss. (The patch does survive check-world, FWIW.)

I think that restriction is probably OK. Even if you have some kind
of chain of CVs where you get woken up, check your interesting
condition and discover that it's now true so you exit you loop and
immediately want to broadcast a signal to some other CV, you'd simply
have to make sure that you put ConditionVariableCancelSleep() before
ConditionVariableBroadcast():

ConditionVariablePrepareToSleep(cv1);
while (condition for which we are waiting is not true)
ConditionVariableSleep(cv1, wait_event_info);
ConditionVariableCancelSleep();
ConditionVariableBroadcast(cv2);

It would only be a problem if you are interested in broadcasting to
cv2 when you've been woken up and the condition is *still not true*,
that is, when you've been spuriously woken. But why would anyone want
to forward spurious wakeups to another CV?

But if that seems too arbitrary, one way to lift the restriction would
be to teach ConditionVariableBroadcast() to call
ConditionVariableCancelSleep() if cv_sleep_target is non-NULL where
you have the current assertion. Code that is still waiting for a CV
must be in a loop that will eventually re-add it in
ConditionVariableSleep(), and it won't miss any signals that it can't
afford to miss because the first call to ConditionVariableSleep() will
return immediately so the caller will recheck its condition.

--
Thomas Munro
http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Thomas Munro (#7)

Re: Condition variable live lock

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Fri, Jan 5, 2018 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Now, the limitation with this is that we can't be waiting for any *other*
condition variable, because then we'd be trashing our state about that
variable. As coded, we can't be waiting for the target CV either, but
that case could actually be handled if we needed to, as per the comment.
I do not know if this is likely to be a problematic limitation
... discuss. (The patch does survive check-world, FWIW.)

... one way to lift the restriction would
be to teach ConditionVariableBroadcast() to call
ConditionVariableCancelSleep() if cv_sleep_target is non-NULL where
you have the current assertion. Code that is still waiting for a CV
must be in a loop that will eventually re-add it in
ConditionVariableSleep(), and it won't miss any signals that it can't
afford to miss because the first call to ConditionVariableSleep() will
return immediately so the caller will recheck its condition.

Oh, of course, very simple.

I thought of another possible issue, though. In the situation where
someone else has removed our sentinel (presumably, by issuing
ConditionVariableSignal just before we were about to remove the
sentinel), my patch assumes we can just do nothing. But it seems
like that amounts to losing one signal. Whoever the someone else
was probably expected to awaken a waiter, and now that won't happen.
Should we rejigger the logic so that it awakens one additional waiter
(if there is one) after detecting that someone else has removed the
sentinel? Obviously, this trades a risk of loss of wakeup for a risk
of spurious wakeup, but presumably the latter is something we can
cope with.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#8)

Re: Condition variable live lock

On Fri, Jan 5, 2018 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I thought of another possible issue, though. In the situation where
someone else has removed our sentinel (presumably, by issuing
ConditionVariableSignal just before we were about to remove the
sentinel), my patch assumes we can just do nothing. But it seems
like that amounts to losing one signal. Whoever the someone else
was probably expected to awaken a waiter, and now that won't happen.

Yeah, that's bad.

Should we rejigger the logic so that it awakens one additional waiter
(if there is one) after detecting that someone else has removed the
sentinel? Obviously, this trades a risk of loss of wakeup for a risk
of spurious wakeup, but presumably the latter is something we can
cope with.

One detail is that the caller of ConditionVariableSignal() got a true
return value when it took out the sentinel (indicating that someone
received the signal), and now when you call ConditionVariableSignal()
because !aminlist there may be no one there. I'm not sure if that's a
problem. For comparison, pthread_cond_signal() doesn't tell you if
you actually signalled anyone. Maybe the only reason we have that
return code is so that ConditionVariableBroadcast() can use it the way
it does in master...

An alternative would be to mark sentinel entries somehow so that
signallers can detect them and signal again, but that's not
backpatchable.

--
Thomas Munro
http://www.enterprisedb.com

#10

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Thomas Munro (#9)

Re: Condition variable live lock

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Fri, Jan 5, 2018 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Should we rejigger the logic so that it awakens one additional waiter
(if there is one) after detecting that someone else has removed the
sentinel? Obviously, this trades a risk of loss of wakeup for a risk
of spurious wakeup, but presumably the latter is something we can
cope with.

One detail is that the caller of ConditionVariableSignal() got a true
return value when it took out the sentinel (indicating that someone
received the signal), and now when you call ConditionVariableSignal()
because !aminlist there may be no one there. I'm not sure if that's a
problem. For comparison, pthread_cond_signal() doesn't tell you if
you actually signalled anyone. Maybe the only reason we have that
return code is so that ConditionVariableBroadcast() can use it the way
it does in master...

Indeed, it looks like no other caller is paying attention to the result.
We could live with the uncertainty in the back branches, and redefine
ConditionVariableSignal as returning void in master.

regards, tom lane

#11

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#10)

Re: Condition variable live lock

On Fri, Jan 5, 2018 at 7:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Indeed, it looks like no other caller is paying attention to the result.
We could live with the uncertainty in the back branches, and redefine
ConditionVariableSignal as returning void in master.

Could we install the sentinel and pop the first entry at the same
time, so that we're not adding an extra spinlock acquire/release?

--
Thomas Munro
http://www.enterprisedb.com

#12

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#8)

Re: Condition variable live lock

Tom Lane wrote:

Obviously, this trades a risk of loss of wakeup for a risk
of spurious wakeup, but presumably the latter is something we can
cope with.

I wonder if it'd be useful to have a test mode for condition variables
that spurious wakups happen randomly, to verify that every user is
prepared to cope correctly.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Thomas Munro (#11)

Re: Condition variable live lock

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Could we install the sentinel and pop the first entry at the same
time, so that we're not adding an extra spinlock acquire/release?

Hm, maybe. Other ideas in that space:

* if queue is empty when we first acquire the spinlock, we don't
have to do anything at all.

* if queue is empty after we pop the first entry, we needn't bother
installing our sentinel, just signal that proc and we're done.

It's a question of how complicated you're willing to make this
logic, and whether you trust that we'll be able to test all the
code paths.

regards, tom lane

#14

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#13)

Re: Condition variable live lock

I wrote:

It's a question of how complicated you're willing to make this
logic, and whether you trust that we'll be able to test all the
code paths.

Attached is a patch incorporating all the ideas mentioned in this thread,
except that I think in HEAD we should change both ConditionVariableSignal
and ConditionVariableBroadcast to return void rather than a possibly
misleading wakeup count. This could be back-patched as is, though.

As I feared, the existing regression tests are not really adequate for
this: gcov testing shows that the sentinel-inserting code path is
never entered, meaning ConditionVariableBroadcast never sees more
than one waiter. What's more, it's now also apparent that no outside
caller of ConditionVariableSignal ever actually awakens anything.
So I think it'd be a good idea to expand the regression tests if we
can do so cheaply. Anybody have ideas about that? Perhaps a new
module under src/test/modules would be the best way? Alternatively,
we could drop some of the optimization ideas.

BTW, at least on gaur, this does nothing for the runtime of the join
test, meaning I'd still like to see some effort put into reducing that.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#14)

Re: Condition variable live lock

I wrote:

As I feared, the existing regression tests are not really adequate for
this: gcov testing shows that the sentinel-inserting code path is
never entered, meaning ConditionVariableBroadcast never sees more
than one waiter.

Hmm ... adding tracing log printouts in BarrierArriveAndWait disproves
that: the regression tests do, repeatedly, call ConditionVariableBroadcast
when there are two waiters to be woken (and you can get it to be more if
you raise the number of parallel workers specified in the PHJ tests).
It just doesn't happen with --enable-coverage. Heisenberg wins again!

I am not sure why coverage tracking changes the behavior so much, but
it's clear from my results that if you change all the PHJ test cases in
join.sql to use 4 parallel workers, then you get plenty of barrier release
events with 4 or 5 barrier participants --- but running the identical test
under --enable-coverage results in only a very small number of releases
with even as many as 2 participants, let alone more. Perhaps the PHJ test
cases don't run long enough to let slow-starting workers join in?

Anyway, that may or may not indicate something we should tune at a higher
level, but I'm now satisfied that the patch as presented works and does
get tested by our existing tests. So barring objection I'll commit that
shortly.

There are some other things I don't like about condition_variable.c:

* I think the Asserts in ConditionVariablePrepareToSleep and
ConditionVariableSleep ought to be replaced by full-fledged test and
elog(ERROR), so that they are enforced even in non-assert builds.
I don't have a lot of confidence that corner cases that could violate
those usage restrictions would get caught during developer testing.
Nor do I see an argument that we can't afford the cycles to check.

* ConditionVariablePrepareToSleep needs to be rearranged so that failure
to create the WaitEventSet doesn't leave us in an invalid state.

* A lot of the comments could be improved, IMHO.

Barring objection I'll go deal with those things, too.

regards, tom lane

#16

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Tom Lane (#15)

Re: Condition variable live lock

On Fri, Jan 5, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* I think the Asserts in ConditionVariablePrepareToSleep and
ConditionVariableSleep ought to be replaced by full-fledged test and
elog(ERROR), so that they are enforced even in non-assert builds.
I don't have a lot of confidence that corner cases that could violate
those usage restrictions would get caught during developer testing.
Nor do I see an argument that we can't afford the cycles to check.

I think those usage restrictions are pretty basic, and I think this
might be used in some places where performance does matter. So -1
from me for this change.

* ConditionVariablePrepareToSleep needs to be rearranged so that failure
to create the WaitEventSet doesn't leave us in an invalid state.

+1.

* A lot of the comments could be improved, IMHO.

No opinion without seeing what you propose to change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Robert Haas (#16)

Re: Condition variable live lock

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jan 5, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* I think the Asserts in ConditionVariablePrepareToSleep and
ConditionVariableSleep ought to be replaced by full-fledged test and
elog(ERROR), so that they are enforced even in non-assert builds.
I don't have a lot of confidence that corner cases that could violate
those usage restrictions would get caught during developer testing.
Nor do I see an argument that we can't afford the cycles to check.

I think those usage restrictions are pretty basic, and I think this
might be used in some places where performance does matter. So -1
from me for this change.

Really? We're about to do a process sleep, and we can't afford a single
test and branch to make sure we're doing it sanely?

* A lot of the comments could be improved, IMHO.

No opinion without seeing what you propose to change.

OK, will put out a proposal.

regards, tom lane

#18

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#14)

Re: Condition variable live lock

On Sat, Jan 6, 2018 at 6:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As I feared, the existing regression tests are not really adequate for
this: gcov testing shows that the sentinel-inserting code path is
never entered, meaning ConditionVariableBroadcast never sees more
than one waiter. What's more, it's now also apparent that no outside
caller of ConditionVariableSignal ever actually awakens anything.
So I think it'd be a good idea to expand the regression tests if we
can do so cheaply. Anybody have ideas about that? Perhaps a new
module under src/test/modules would be the best way? Alternatively,
we could drop some of the optimization ideas.

I think I might have a suitable test module already. I'll tidy it up
and propose it in a few days.

BTW, at least on gaur, this does nothing for the runtime of the join
test, meaning I'd still like to see some effort put into reducing that.

Will do.

--
Thomas Munro
http://www.enterprisedb.com

#19

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#17)

Re: Condition variable live lock

I wrote:

Robert Haas <robertmhaas@gmail.com> writes:

No opinion without seeing what you propose to change.

OK, will put out a proposal.

I began with the intention of making no non-cosmetic changes, but then
I started to wonder why ConditionVariablePrepareToSleep bothers with a
!proclist_contains test, when the calling process surely ought not be
in the list -- or any other list -- since it wasn't prepared to sleep.
And that led me down a different rabbit hole ending in the conclusion
that proclist.h could stand some improvements too. I do not like the
fact that it's impossible to tell whether a proclist_node is in any
proclist or not. Initially, a proclist_node contains zeroes which is
a distinguishable state, but proclist_delete_offset resets it to
next = prev = INVALID_PGPROCNO which looks the same as a node that's in a
singleton list. We should have it reset to the initial state of zeroes
instead, and then we can add assertions to proclist_push_xxx that the
supplied node is not already in a list. Hence, I propose the first
attached patch which tightens things up in proclist.h and then removes
the !proclist_contains test in ConditionVariablePrepareToSleep; the
assertion in proclist_push_tail supersedes that.

The second attached patch is the cosmetic changes I want to make in
condition_variable.c/.h.

I still think that we ought to change the Asserts on cv_sleep_target in
ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
test-and-elog tests. Those are checking a global correctness property
("global" meaning "interactions between completely unrelated modules can
break this"), and they'd be extremely cheap compared to the rest of what
those functions are doing, so I think insisting that they be Asserts is
penny wise and pound foolish. Anybody besides Robert want to vote on
that?

Another loose end that I'm seeing here is that while a process waiting on
a condition variable will respond to a cancel or die interrupt, it will
not notice postmaster death. This seems unwise to me. I think we should
adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
condition and just do a summary proc_exit(1) if it sees that. I'd even
argue that that is a back-patchable bug fix.

regards, tom lane

#20

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#19)

Re: Condition variable live lock

I wrote:

I still think that we ought to change the Asserts on cv_sleep_target in
ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
test-and-elog tests. Those are checking a global correctness property
("global" meaning "interactions between completely unrelated modules can
break this"), and they'd be extremely cheap compared to the rest of what
those functions are doing, so I think insisting that they be Asserts is
penny wise and pound foolish.

Actually ... perhaps a better design would be to have
ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
a different condition variable, analogously to what we just did in
ConditionVariableBroadcast, on the same theory that whenever control
returns to the other CV wait loop it can re-establish the relevant
state easily enough. I have to think that if the use of CVs grows
much, the existing restriction is going to become untenable anyway,
so why not just get rid of it?

regards, tom lane

#21

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#20)

#22

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#21)

#23