Race condition in InvalidateObsoleteReplicationSlots()
Hi,
I was looking at InvalidateObsoleteReplicationSlots() while reviewing /
polishing the logical decoding on standby patch. Which lead me to notice that
I think there's a race in InvalidateObsoleteReplicationSlots() (because
ResolveRecoveryConflictWithLogicalSlots has a closely related one).
void
InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
{
...
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
for (int i = 0; i < max_replication_slots; i++)
{
...
if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
continue;
LWLockRelease(ReplicationSlotControlLock);
...
for (;;)
{
...
wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire);
...
SpinLockAcquire(&s->mutex);
s->data.invalidated_at = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(&s->mutex);
...
As far as I can tell there's no guarantee that the slot wasn't concurrently
dropped and another replication slot created at the same offset in
ReplicationSlotCtl->replication_slots. Which we then promptly would
invalidate, regardless of the slot not actually needing to be invalidated.
Note that this is different from the race mentioned in a comment:
/*
* Signal to terminate the process that owns the slot.
*
* There is the race condition where other process may own
* the slot after the process using it was terminated and before
* this process owns it. To handle this case, we signal again
* if the PID of the owning process is changed than the last.
*
* XXX This logic assumes that the same PID is not reused
* very quickly.
*/
It's one thing to terminate a connection erroneously - permanently breaking a
replica due to invalidating the wrong slot or such imo is different.
Interestingly this problem seems to have been present both in
commit c6550776394e25c1620bc8258427c8f1d448080d
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2020-04-07 18:35:00 -0400
Allow users to limit storage reserved by replication slots
commit f9e9704f09daf882f5a1cf1fbe3f5a3150ae2bb9
Author: Fujii Masao <fujii@postgresql.org>
Date: 2020-06-19 17:15:52 +0900
Fix issues in invalidation of obsolete replication slots.
I think this can be solved in two different ways:
1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
slot in the to-be-obsoleted-slot's place.
2) Atomically check whether the slot needs to be invalidated and try to
acquire if needed. Don't release ReplicationSlotControlLock between those
two steps. Signal the owner to release the slot iff we couldn't acquire the
slot. In the latter case wait and then recheck if the slot still needs to
be dropped.
To me 2) seems better, because we then can also be sure that the slot still
needs to be obsoleted, rather than potentially doing so unnecessarily.
It looks to me like several of the problems here stem from trying to reuse
code from ReplicationSlotAcquireInternal() (which before this was just named
ReplicationSlotAcquire()). I don't think that makes sense, because cases like
this want to check if a condition is true, and acquire it only if so.
IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
except that a different condition is checked, and the if (active_pid) case
needs to prepare a condition variable, signal the owner and then wait on the
condition variable, to restart after.
Greetings,
Andres Freund
Hi,
On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
I think this can be solved in two different ways:
1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
slot in the to-be-obsoleted-slot's place.2) Atomically check whether the slot needs to be invalidated and try to
acquire if needed. Don't release ReplicationSlotControlLock between those
two steps. Signal the owner to release the slot iff we couldn't acquire the
slot. In the latter case wait and then recheck if the slot still needs to
be dropped.To me 2) seems better, because we then can also be sure that the slot still
needs to be obsoleted, rather than potentially doing so unnecessarily.It looks to me like several of the problems here stem from trying to reuse
code from ReplicationSlotAcquireInternal() (which before this was just named
ReplicationSlotAcquire()). I don't think that makes sense, because cases like
this want to check if a condition is true, and acquire it only if so.IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
except that a different condition is checked, and the if (active_pid) case
needs to prepare a condition variable, signal the owner and then wait on the
condition variable, to restart after.
I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?
I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.
Sketch for a fix attached. I did leave the odd
ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
there...
After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...
Greetings,
Andres Freund
Attachments:
On Thu, Apr 8, 2021 at 7:39 AM Andres Freund <andres@anarazel.de> wrote:
On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
I think this can be solved in two different ways:
1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
slot in the to-be-obsoleted-slot's place.2) Atomically check whether the slot needs to be invalidated and try to
acquire if needed. Don't release ReplicationSlotControlLock between those
two steps. Signal the owner to release the slot iff we couldn't acquire the
slot. In the latter case wait and then recheck if the slot still needs to
be dropped.To me 2) seems better, because we then can also be sure that the slot still
needs to be obsoleted, rather than potentially doing so unnecessarily.
+1.
It looks to me like several of the problems here stem from trying to reuse
code from ReplicationSlotAcquireInternal() (which before this was just named
ReplicationSlotAcquire()). I don't think that makes sense, because cases like
this want to check if a condition is true, and acquire it only if so.IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
except that a different condition is checked, and the if (active_pid) case
needs to prepare a condition variable, signal the owner and then wait on the
condition variable, to restart after.I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.Sketch for a fix attached. I did leave the odd
ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
there...
I haven't tested the patch but I couldn't spot any problems while
reading it. A minor point, don't we need to use
ConditionVariableCancelSleep() at some point after doing
ConditionVariableTimedSleep?
--
With Regards,
Amit Kapila.
On 2021-Apr-07, Andres Freund wrote:
I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?
I was scared of the possibility that a process would not set the CV for
whatever reason, causing checkpointing to become stuck. Maybe that's
misguided thinking if CVs are reliable enough.
I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.
Hmm .. yeah, doing CFI seems pretty useless. I think that should just
be removed. If checkpointer gets USR2 (request for shutdown) it's not
going to affect the behavior of CFI anyway.
I attach a couple of changes to your 0001. It's all cosmetic; what
looks not so cosmetic is the change of "continue" to "break" in helper
routine; if !s->in_use, we'd loop infinitely. The other routine
already checks that before calling the helper; since you hold
ReplicationSlotControlLock at that point, it should not be possible to
drop it in between. Anyway, it's a trivial change to make, so it should
be correct.
I also added a "continue" at the bottom of one block; currently that
doesn't change any behavior, but if we add code at the other block, it
might not be what's intended.
After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...
Hmm, interesting ... If not needed, yeah let's get rid of that.
Are you getting this set pushed, or would you like me to handle it?
(There seems to be some minor conflict in 13)
--
Ãlvaro Herrera Valdivia, Chile
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
Attachments:
Hi,
On 2021-04-08 17:03:41 +0530, Amit Kapila wrote:
I haven't tested the patch but I couldn't spot any problems while
reading it. A minor point, don't we need to use
ConditionVariableCancelSleep() at some point after doing
ConditionVariableTimedSleep?
It's not really necessary - unless the CV could get deallocated as part
of dynamic shared memory or such.
Greetings,
Andres Freund
Hi,
On 2021-04-29 13:28:20 -0400, Ãlvaro Herrera wrote:
On 2021-Apr-07, Andres Freund wrote:
I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?I was scared of the possibility that a process would not set the CV for
whatever reason, causing checkpointing to become stuck. Maybe that's
misguided thinking if CVs are reliable enough.
They better be, or we have bigger problems. And if it's an escape hatch
we surely ought not to use 10ms as the timeout. That's an appropriate
time for something *not* using condition variables...
I attach a couple of changes to your 0001. It's all cosmetic; what
looks not so cosmetic is the change of "continue" to "break" in helper
routine; if !s->in_use, we'd loop infinitely. The other routine
already checks that before calling the helper; since you hold
ReplicationSlotControlLock at that point, it should not be possible to
drop it in between. Anyway, it's a trivial change to make, so it should
be correct.
I also added a "continue" at the bottom of one block; currently that
doesn't change any behavior, but if we add code at the other block, it
might not be what's intended.
Seems sane.
Are you getting this set pushed, or would you like me to handle it?
(There seems to be some minor conflict in 13)
I'd be quite happy for you to handle it...
Greetings,
Andres Freund
Here's a version that I feel is committable (0001). There was an issue
when returning from the inner loop, if in a previous iteration we had
released the lock. In that case we need to return with the lock not
held, so that the caller can acquire it again, but weren't. This seems
pretty hard to hit in practice (I suppose somebody needs to destroy the
slot just as checkpointer killed the walsender, but before checkpointer
marks it as its own) ... but if it did happen, I think checkpointer
would block with no recourse. Also added some comments and slightly
restructured the code.
There are plenty of conflicts in pg13, but it's all easy to handle.
I wrote a test (0002) to cover the case of signalling a walsender, which
is currently not covered (we only deal with the case of a standby that's
not running). There are some sharp edges in this code -- I had to make
it use background_psql() to send a CHECKPOINT, which hangs, because I
previously send a SIGSTOP to the walreceiver. Maybe there's a better
way to achieve a walreceiver that remains connected but doesn't consume
input from the primary, but I don't know what it is. Anyway, the code
becomes covered with this. I would like to at least see it in master,
to gather some reactions from buildfarm.
--
Ãlvaro Herrera Valdivia, Chile
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.
Attachments:
On 2021-Jun-10, Ãlvaro Herrera wrote:
I wrote a test (0002) to cover the case of signalling a walsender, which
is currently not covered (we only deal with the case of a standby that's
not running). There are some sharp edges in this code -- I had to make
it use background_psql() to send a CHECKPOINT, which hangs, because I
previously send a SIGSTOP to the walreceiver. Maybe there's a better
way to achieve a walreceiver that remains connected but doesn't consume
input from the primary, but I don't know what it is. Anyway, the code
becomes covered with this. I would like to at least see it in master,
to gather some reactions from buildfarm.
Small fixup to the test one, so that skipping it on Windows works
correctly.
--
Ãlvaro Herrera 39°49'30"S 73°17'W
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
Attachments:
On 2021-Jun-10, Ãlvaro Herrera wrote:
Here's a version that I feel is committable (0001). There was an issue
when returning from the inner loop, if in a previous iteration we had
released the lock. In that case we need to return with the lock not
held, so that the caller can acquire it again, but weren't. This seems
pretty hard to hit in practice (I suppose somebody needs to destroy the
slot just as checkpointer killed the walsender, but before checkpointer
marks it as its own) ... but if it did happen, I think checkpointer
would block with no recourse. Also added some comments and slightly
restructured the code.There are plenty of conflicts in pg13, but it's all easy to handle.
Pushed, with additional minor changes.
I wrote a test (0002) to cover the case of signalling a walsender, which
is currently not covered (we only deal with the case of a standby that's
not running). There are some sharp edges in this code -- I had to make
it use background_psql() to send a CHECKPOINT, which hangs, because I
previously send a SIGSTOP to the walreceiver. Maybe there's a better
way to achieve a walreceiver that remains connected but doesn't consume
input from the primary, but I don't know what it is. Anyway, the code
becomes covered with this. I would like to at least see it in master,
to gather some reactions from buildfarm.
I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails. Why? that's not clear to me). Attached is the last one I have,
in case somebody wants to make it better. Maybe there's some completely
different approach that works better, but I'm out of ideas for now.
--
Ãlvaro Herrera Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarÃan al hombre" (Ijon Tichy)
Attachments:
On 2021-Apr-07, Andres Freund wrote:
After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...
Pushed 0002.
Thanks!
--
Ãlvaro Herrera 39°49'30"S 73°17'W
"La persona que no querÃa pecar / estaba obligada a sentarse
en duras y empinadas sillas / desprovistas, por cierto
de blandos atenuantes" (Patricio Vogel)
On 2021-06-11 15:52:21 -0400, Ãlvaro Herrera wrote:
On 2021-Apr-07, Andres Freund wrote:
After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...Pushed 0002.
Thanks!
Thank you for your work on this!
On 2021-Jun-11, Ãlvaro Herrera wrote:
I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails. Why? that's not clear to me). Attached is the last one I have,
in case somebody wants to make it better. Maybe there's some completely
different approach that works better, but I'm out of ideas for now.
It occurred to me that this could be made better by sigstopping both
walreceiver and walsender, then letting only the latter run; AFAICS this
makes the test stable. I'll register this on the upcoming commitfest to
let cfbot run it, and if it looks good there I'll get it pushed to
master. If there's any problem I'll just remove it before beta2 is
stamped.
--
Ãlvaro Herrera Valdivia, Chile
Attachments:
Apologies, I inadvertently sent the version before I added a maximum
number of iterations in the final loop.
--
Ãlvaro Herrera Valdivia, Chile
"La fuerza no está en los medios fÃsicos
sino que reside en una voluntad indomable" (Gandhi)
Attachments:
=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes:
It occurred to me that this could be made better by sigstopping both
walreceiver and walsender, then letting only the latter run; AFAICS this
makes the test stable. I'll register this on the upcoming commitfest to
let cfbot run it, and if it looks good there I'll get it pushed to
master. If there's any problem I'll just remove it before beta2 is
stamped.
Hmm ... desmoxytes has failed this test once, out of four runs since
it went in:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04
None of the other animals that have reported in so far are unhappy.
Still, maybe that's not a track record we want to have for beta2?
I've just launched a run on gaur, which given its dinosaur status
might be the most likely animal to have an actual portability problem
with this test technique. If you want to wait a few hours to see what
it says, that'd be fine with me.
regards, tom lane
Hah, desmoxytes failed once:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04
I'll revert it and investigate later. There have been no other
failures.
--
Ãlvaro Herrera 39°49'30"S 73°17'W
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idÃlicas" (Ijon Tichy)
I wrote:
Hmm ... desmoxytes has failed this test once, out of four runs since
it went in:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04
I studied this failure a bit more, and I think the test itself has
a race condition. It's doing
# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
"walreceiver termination logged");
The string it's looking for does show up in node_primary3's log, but
not for another second or so; we can see instances of the following
poll_query_until query before that happens. So the problem is that
there is no interlock to ensure that the walreceiver terminates
before this find_in_log check looks for it.
You should be able to fix this by adding a retry loop around the
find_in_log check (which would likely mean that you don't need
to do multiple advance_wal iterations here).
However, I agree with reverting the test for now and then trying
again after beta2.
regards, tom lane
I wrote:
I studied this failure a bit more, and I think the test itself has
a race condition. It's doing# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
"walreceiver termination logged");
Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.
I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string. The full
message in primary3's log looks like
2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG: terminating process 3548959 to release replication slot "rep3"
and I don't understand why we wouldn't match on the whole
message text. (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)
regards, tom lane
On 2021-Jun-20, Tom Lane wrote:
Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.
I accounted for all those things and pushed again.
--
Ãlvaro Herrera Valdivia, Chile
"I can see support will not be a problem. 10 out of 10." (Simon Wittber)
(http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Jun-20, Tom Lane wrote:
Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.I accounted for all those things and pushed again.
I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.
Regards,
Vignesh
On 2021-Jul-05, vignesh C wrote:
On Wed, Jun 23, 2021 at 7:32 PM Ãlvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Jun-20, Tom Lane wrote:
Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.I accounted for all those things and pushed again.
I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.
There is none that I'm aware of, please mark it committed. Thanks
--
Ãlvaro Herrera PostgreSQL Developer â https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Jul-05, vignesh C wrote:
On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Jun-20, Tom Lane wrote:
Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond. But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart. I think you
need to capture that value before the kill not after.I accounted for all those things and pushed again.
I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.There is none that I'm aware of, please mark it committed. Thanks
Thanks for confirming, I have marked it as committed.
Regards,
Vignesh
Hi,
On 2021-06-11 12:27:57 -0400, Ãlvaro Herrera wrote:
On 2021-Jun-10, Ãlvaro Herrera wrote:
Here's a version that I feel is committable (0001). There was an issue
when returning from the inner loop, if in a previous iteration we had
released the lock. In that case we need to return with the lock not
held, so that the caller can acquire it again, but weren't. This seems
pretty hard to hit in practice (I suppose somebody needs to destroy the
slot just as checkpointer killed the walsender, but before checkpointer
marks it as its own) ... but if it did happen, I think checkpointer
would block with no recourse. Also added some comments and slightly
restructured the code.There are plenty of conflicts in pg13, but it's all easy to handle.
Pushed, with additional minor changes.
I stared at this code, due to [1]/messages/by-id/20220218231415.c4plkp4i3reqcwip@alap3.anarazel.de, and I think I found a bug. I think it's not
the cause of the failures in that thread, but we probably should still do
something about it.
I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep(). Without re-checking the sleep condition
until
/* Wait until the slot is released. */
ConditionVariableSleep(&s->active_cv,
WAIT_EVENT_REPLICATION_SLOT_DROP);
which directly violates what ConditionVariablePrepareToSleep() documents:
* This can optionally be called before entering a test/sleep loop.
* Doing so is more efficient if we'll need to sleep at least once.
* However, if the first test of the exit condition is likely to succeed,
* it's more efficient to omit the ConditionVariablePrepareToSleep call.
* See comments in ConditionVariableSleep for more detail.
*
* Caution: "before entering the loop" means you *must* test the exit
* condition between calling ConditionVariablePrepareToSleep and calling
* ConditionVariableSleep. If that is inconvenient, omit calling
* ConditionVariablePrepareToSleep.
Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().
There's a comment that's mentioning this danger:
/*
* Prepare the sleep on the slot's condition variable before
* releasing the lock, to close a possible race condition if the
* slot is released before the sleep below.
*/
ConditionVariablePrepareToSleep(&s->active_cv);
LWLockRelease(ReplicationSlotControlLock);
but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.
We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.
Greetings,
Andres Freund
[1]: /messages/by-id/20220218231415.c4plkp4i3reqcwip@alap3.anarazel.de
Hi,
On 2022-02-22 17:48:55 -0800, Andres Freund wrote:
I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep(). Without re-checking the sleep condition
until
/* Wait until the slot is released. */
ConditionVariableSleep(&s->active_cv,
WAIT_EVENT_REPLICATION_SLOT_DROP);which directly violates what ConditionVariablePrepareToSleep() documents:
* This can optionally be called before entering a test/sleep loop.
* Doing so is more efficient if we'll need to sleep at least once.
* However, if the first test of the exit condition is likely to succeed,
* it's more efficient to omit the ConditionVariablePrepareToSleep call.
* See comments in ConditionVariableSleep for more detail.
*
* Caution: "before entering the loop" means you *must* test the exit
* condition between calling ConditionVariablePrepareToSleep and calling
* ConditionVariableSleep. If that is inconvenient, omit calling
* ConditionVariablePrepareToSleep.Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().There's a comment that's mentioning this danger:
/*
* Prepare the sleep on the slot's condition variable before
* releasing the lock, to close a possible race condition if the
* slot is released before the sleep below.
*/
ConditionVariablePrepareToSleep(&s->active_cv);LWLockRelease(ReplicationSlotControlLock);
but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.
Not at the start of the while loop, outside of the while loop. Doing it in the
loop body doesn't make sense, even if it's at the top. Each
ConditionVariablePrepareToSleep() will unregister itself:
/*
* If some other sleep is already prepared, cancel it; this is necessary
* because we have just one static variable tracking the prepared sleep,
* and also only one cvWaitLink in our PGPROC. It's okay to do this
* because whenever control does return to the other test-and-sleep loop,
* its ConditionVariableSleep call will just re-establish that sleep as
* the prepared one.
*/
if (cv_sleep_target != NULL)
ConditionVariableCancelSleep();
The intended use is documented in this comment:
* This should be called in a predicate loop that tests for a specific exit
* condition and otherwise sleeps, like so:
*
* ConditionVariablePrepareToSleep(cv); // optional
* while (condition for which we are waiting is not true)
* ConditionVariableSleep(cv, wait_event_info);
* ConditionVariableCancelSleep();
Greetings,
Andres Freund