More race conditions in logical replication
I noticed a recent failure that looked suspiciously like a race condition:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
The critical bit in the log file is
error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.
After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit). It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.
I'm not particularly interested in fixing this myself, so I'm just
going to add it to the open items list.
regards, tom lane
Attachments:
break_repl_slot_management.patchtext/x-diff; charset=us-ascii; name=break_repl_slot_management.patchDownload+10-4
On Sun, Jul 02, 2017 at 07:54:48PM -0400, Tom Lane wrote:
I noticed a recent failure that looked suspiciously like a race condition:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
The critical bit in the log file is
error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit). It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.I'm not particularly interested in fixing this myself, so I'm just
going to add it to the open items list.
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.
[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Noah Misch wrote:
The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item.
I volunteer to own this item. My next update is going to be on or
before Friday 7th at 19:00 Chilean time, though I don't think I can have
this ready before then. I expect a fix for this to miss next week's
beta release, but I'll try to get it done sooner.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/07/17 01:54, Tom Lane wrote:
I noticed a recent failure that looked suspiciously like a race condition:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
The critical bit in the log file is
error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit). It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.
Thanks for the test case.
It's not actually that we start new worker fast. It's that we try to
drop the slot right after worker process was killed but if the code that
clears slot's active_pid takes too long, it still looks like it's being
used. I am quite sure it's possible to make this happen for physical
replication as well when using slots.
This is not something that can be solved by locking on subscriber. ISTM
we need to make pg_drop_replication_slot behave more nicely, like making
it wait for the slot to become available (either by default or as an
option).
I'll have to think about how to do it without rewriting half of
replication slots or reimplementing lock queue though because the
replication slots don't use normal catalog access so there is no object
locking with wait queue. We could use some latch wait with small timeout
but that seems ugly as that function can be called by user without
having dropped the slot before so the wait can be quite long (as in
"forever").
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/07/17 17:33, Petr Jelinek wrote:
On 03/07/17 01:54, Tom Lane wrote:
I noticed a recent failure that looked suspiciously like a race condition:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
The critical bit in the log file is
error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit). It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.Thanks for the test case.
It's not actually that we start new worker fast. It's that we try to
drop the slot right after worker process was killed but if the code that
clears slot's active_pid takes too long, it still looks like it's being
used. I am quite sure it's possible to make this happen for physical
replication as well when using slots.This is not something that can be solved by locking on subscriber. ISTM
we need to make pg_drop_replication_slot behave more nicely, like making
it wait for the slot to become available (either by default or as an
option).I'll have to think about how to do it without rewriting half of
replication slots or reimplementing lock queue though because the
replication slots don't use normal catalog access so there is no object
locking with wait queue. We could use some latch wait with small timeout
but that seems ugly as that function can be called by user without
having dropped the slot before so the wait can be quite long (as in
"forever").
Naive fix would be something like attached. But as I said, it's not
exactly pretty.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchtext/x-patch; name=0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchDownload+50-16
On 06/07/17 18:20, Petr Jelinek wrote:
On 06/07/17 17:33, Petr Jelinek wrote:
On 03/07/17 01:54, Tom Lane wrote:
I noticed a recent failure that looked suspiciously like a race condition:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
The critical bit in the log file is
error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit). It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.Thanks for the test case.
It's not actually that we start new worker fast. It's that we try to
drop the slot right after worker process was killed but if the code that
clears slot's active_pid takes too long, it still looks like it's being
used. I am quite sure it's possible to make this happen for physical
replication as well when using slots.This is not something that can be solved by locking on subscriber. ISTM
we need to make pg_drop_replication_slot behave more nicely, like making
it wait for the slot to become available (either by default or as an
option).I'll have to think about how to do it without rewriting half of
replication slots or reimplementing lock queue though because the
replication slots don't use normal catalog access so there is no object
locking with wait queue. We could use some latch wait with small timeout
but that seems ugly as that function can be called by user without
having dropped the slot before so the wait can be quite long (as in
"forever").Naive fix would be something like attached. But as I said, it's not
exactly pretty.
So best idea I could come up with is to make use of the new condition
variable API. That lets us wait for variable which can be set per slot.
It's not backportable however, I am not sure how big problem that is
considering the lack of complaints until now (maybe we could backport
using the ugly timeout version?).
The attached patch is a prototype of such solution and there are some
race conditions (variable can get signaled before the waiting process
starts sleeping for it). I am mainly sending it to get feedback on the
approach.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
v2-0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchtext/x-patch; name=v2-0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchDownload+46-16
Petr Jelinek wrote:
So best idea I could come up with is to make use of the new condition
variable API. That lets us wait for variable which can be set per slot.It's not backportable however, I am not sure how big problem that is
considering the lack of complaints until now (maybe we could backport
using the ugly timeout version?).The attached patch is a prototype of such solution and there are some
race conditions (variable can get signaled before the waiting process
starts sleeping for it). I am mainly sending it to get feedback on the
approach.
This one looks a lot better than the original one.
I wonder if it's possible to do this using lwlocks instead of condition
variables (similar to how we do the "wait for IO" thing both for slots
and buffers). We could backport that easily ...
I'll next update this on or before Monday 10th at 19:00 CLT.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I'll next update this on or before Monday 10th at 19:00 CLT.
Schedule note --- we'll be wrapping beta2 on Monday, a couple hours
before that. Although it'd be great to have this issue fixed before
beta2, jamming in a patch just a few hours before wrap is probably
not prudent. If you can't get it done over the weekend, I'd counsel
holding off till after beta2 is tagged.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
I'll next update this on or before Monday 10th at 19:00 CLT.
I couldn't get to this today as I wanted. Next update on Wednesday
12th, same time.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Petr Jelinek wrote:
So best idea I could come up with is to make use of the new condition
variable API. That lets us wait for variable which can be set per slot.
Here's a cleaned up version of that patch, which I intend to get in the
tree tomorrow. I verified that this allows the subscription tests to
pass with Tom's sleep additions.
I noticed one point where we're reading the active_pid without locking;
marked it with an XXX comment. Not yet sure whether this is a bug or
not.
I noticed something funny in CREATE_REPLICATION; apparently we first
create a slot and set it active, then we activate it by name. With the
current coding it seems to work fine, because we're careful to make
activation idempotent, but it looks convoluted. I don't think this is
new, though.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchtext/plain; charset=us-asciiDownload+69-29
On Wed, Jul 12, 2017 at 06:48:28PM -0400, Alvaro Herrera wrote:
Petr Jelinek wrote:
So best idea I could come up with is to make use of the new condition
variable API. That lets us wait for variable which can be set per slot.Here's a cleaned up version of that patch, which I intend to get in the
tree tomorrow. I verified that this allows the subscription tests to
pass with Tom's sleep additions.
This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
After going over this a bunch more times, I found other problems. For
example, I noticed that if I create a temporary slot in one session,
then another session would rightly go to sleep if it tries to drop that
slot. But the slot goes away when the first session disconnects, so I'd
expect the sleeping session to get a signal and wake up, but that
doesn't happen.
I'm going to continue to look at this and report back Tuesday 18th.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
After going over this a bunch more times, I found other problems. For
example, I noticed that if I create a temporary slot in one session,
then another session would rightly go to sleep if it tries to drop that
slot. But the slot goes away when the first session disconnects, so I'd
expect the sleeping session to get a signal and wake up, but that
doesn't happen.I'm going to continue to look at this and report back Tuesday 18th.
I got tangled up in a really tough problem this week, so I won't have
time to work on this for a couple of days. I can next update tomorrow
19:00 CLT (probably not with a final fix yet, though).
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'm back at looking into this again, after a rather exhausting week. I
think I have a better grasp of what is going on in this code now, and it
appears to me that we should change the locking so that active_pid is
protected by ReplicationSlotControlLock instead of the slot's spinlock;
but I haven't analyzed the idea fully yet and I don't have the patch
done yet either. I'm going to report, hopefully with a fix committed
this time, on Monday at 19:00 CLT.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
I'm back at looking into this again, after a rather exhausting week. I
think I have a better grasp of what is going on in this code now,
Here's an updated patch, which I intend to push tomorrow.
and it
appears to me that we should change the locking so that active_pid is
protected by ReplicationSlotControlLock instead of the slot's spinlock;
but I haven't analyzed the idea fully yet and I don't have the patch
done yet either.
This doesn't seem to be a good idea actually.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v4-0001-Wait-for-slot-to-become-free-in-when-dropping-it.patchtext/plain; charset=us-asciiDownload+110-56
On 25/07/17 01:33, Alvaro Herrera wrote:
Alvaro Herrera wrote:
I'm back at looking into this again, after a rather exhausting week. I
think I have a better grasp of what is going on in this code now,Here's an updated patch, which I intend to push tomorrow.
I think the ConditionVariablePrepareToSleep() call in
ReplicationSlotAcquire() needs to be done inside the mutex lock
otherwise there is tiny race condition where the process which has slot
acquired might release the slot between the mutex unlock and the
ConditionVariablePrepareToSleep() call which means we'll never get
signaled and ConditionVariableSleep() will wait forever.
Otherwise the patch looks good to me.
As a side note, the ConditionVariablePrepareToSleep()'s comment could be
improved because currently it says the only advantage is that we skip
double-test in the beginning of ConditionVariableSleep(). But that's not
true, it's essential for preventing race conditions like the one above
because it puts the current process into waiting list so we can be sure
it will be signaled on broadcast once ConditionVariablePrepareToSleep()
has been called.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Petr Jelinek wrote:
On 25/07/17 01:33, Alvaro Herrera wrote:
Alvaro Herrera wrote:
I'm back at looking into this again, after a rather exhausting week. I
think I have a better grasp of what is going on in this code now,Here's an updated patch, which I intend to push tomorrow.
I think the ConditionVariablePrepareToSleep() call in
ReplicationSlotAcquire() needs to be done inside the mutex lock
otherwise there is tiny race condition where the process which has slot
acquired might release the slot between the mutex unlock and the
ConditionVariablePrepareToSleep() call which means we'll never get
signaled and ConditionVariableSleep() will wait forever.
Hmm, yeah, that's not good. However, I didn't like the idea of putting
it inside the locked area, as it's too much code. Instead I added just
before acquiring the spinlock. We cancel the sleep unconditionally
later on if we didn't need to sleep after all.
As I see it, we need to backpatch at least parts of this patch. I've
received reports that in earlier branches pglogical and BDR can
sometimes leave slots behind when removing nodes, and I have a hunch
that it can be explained by the bugs being fixed here. Now we cannot
use condition variables in back-branches, so we'll need to figure out
how to actually do it ...
As a side note, the ConditionVariablePrepareToSleep()'s comment could be
improved because currently it says the only advantage is that we skip
double-test in the beginning of ConditionVariableSleep(). But that's not
true, it's essential for preventing race conditions like the one above
because it puts the current process into waiting list so we can be sure
it will be signaled on broadcast once ConditionVariablePrepareToSleep()
has been called.
Hmm.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 25, 2017 at 1:42 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
As I see it, we need to backpatch at least parts of this patch. I've
received reports that in earlier branches pglogical and BDR can
sometimes leave slots behind when removing nodes, and I have a hunch
that it can be explained by the bugs being fixed here. Now we cannot
use condition variables in back-branches, so we'll need to figure out
how to actually do it ...
If all you had to back-patch was the condition variable code itself,
that might not really be all that bad, but it depends on the
WaitEventSet stuff, which I think is far too dangerous to back-patch.
However, you might be able to create a dumber, slower version that
only uses WaitLatch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 25, 2017 at 5:47 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
As a side note, the ConditionVariablePrepareToSleep()'s comment could be
improved because currently it says the only advantage is that we skip
double-test in the beginning of ConditionVariableSleep(). But that's not
true, it's essential for preventing race conditions like the one above
because it puts the current process into waiting list so we can be sure
it will be signaled on broadcast once ConditionVariablePrepareToSleep()
has been called.
But if you don't call ConditionVariablePrepareToSleep() before calling
ConditionVariableSleep(), then the first call to the latter will call
the former and return without doing anything else. So I don't see how
this can ever go wrong if you're using these primitives as documented.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
Hmm, yeah, that's not good. However, I didn't like the idea of putting
it inside the locked area, as it's too much code. Instead I added just
before acquiring the spinlock. We cancel the sleep unconditionally
later on if we didn't need to sleep after all.
I just noticed that Jacana failed again in the subscription test, and it
looks like it's related to this. I'll take a look tomorrow if no one
beats me to it.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-07-26%2014%3A39%3A54
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers