An example of bugs for Hot Standby

Started by Hiroyuki Yamadaover 16 years ago29 messageshackers
Jump to latest
#1Hiroyuki Yamada
yamada@kokolink.net

Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.

0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A

...then in node B occurs following "deadlock" situation, which is not detected by deadlock check.
* startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup())
* xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y

This situation can occur when
a) a transaction in the standby node tries to acquire ACCESS SHARE lock while holding some buffers
b) startup process calls LockBufferForCleanup() for any of the buffers

regards,

--
Hiroyuki YAMADA
Kokolink Corporation
yamada@kokolink.net

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Hiroyuki Yamada (#1)
Re: An example of bugs for Hot Standby

On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:

Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.

0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A

...then in node B occurs following "deadlock" situation, which is not detected by deadlock check.
* startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup())
* xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y

Deadlock bug was prevented by stop-gap measure in December commit.

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Reviews welcome, otherwise commit at end of week.

--
Simon Riggs www.2ndQuadrant.com

Attachments:

startup_sigalrm.patchtext/x-patch; charset=UTF-8; name=startup_sigalrm.patchDownload+390-115
#3Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#2)
Re: An example of bugs for Hot Standby

On Tuesday 19 January 2010 11:46:24 Simon Riggs wrote:

On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:

Hot Standby node can freeze when startup process calls
LockBufferForCleanup(). This bug can be reproduced by the following
procedure.

0. start Hot Standby, with one active node(node A) and one standby
node(node B) 1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1
open 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A

...then in node B occurs following "deadlock" situation, which is not
detected by deadlock check.

* startup process waits for xact 2 to release buffers in table X (in
LockBufferForCleanup()) * xact 2 waits for startup process to release
ACCESS EXCLUSIVE lock in table Y

Deadlock bug was prevented by stop-gap measure in December commit.

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
its likely but...

From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
acquiring a spinlock which can also get taken at other places. Its not the
most likely scenario, but it would certainly be annoying to debug.

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?

Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-

SendRecoveryConflictWithBufferPin->CancelDBBackends

Too tired to look further.

Greetings,
Andres

PS: Sorry for not doing anything on the idle front - I have been burried alive
the last days.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: An example of bugs for Hot Standby

Andres Freund <andres@anarazel.de> writes:

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so?

Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so?

Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.

So were assuming genereally that a integer cannot be read/written
atomatically?
Or was that only relating to the actualy signal.h typedef?

Andres

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#3)
Re: An example of bugs for Hot Standby

On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
its likely but...

More foolproof and much less accurate. The Startup process doesn't know
who is holding the buffer pin that blocks it, so it could not target a
FATAL.

From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
acquiring a spinlock which can also get taken at other places. Its not the
most likely scenario, but it would certainly be annoying to debug.

Spinlock. It isn't held for long in any situation. What problem do you
foresee?

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?

I prefer spinlocking.

Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-

SendRecoveryConflictWithBufferPin->CancelDBBackends

I don't see an issue.

--
Simon Riggs www.2ndQuadrant.com

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so?

Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.

The reason I have asked is that the code is doing things like:
/*
* Used by backends when they receive a request to check for buffer pin waits.
*/
int
GetStartupBufferPinWaitBufId(void)
{
int bufid;

/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;

SpinLockAcquire(ProcStructLock);

bufid = procglobal->startupBufferPinWaitBufId;

SpinLockRelease(ProcStructLock);

return bufid;
}

or similar things with LWLockAcquire in a signal handler which strikes me as a
not that good idea. As at least on x86 reading an integer is atomic the above
spinlock is pointless. My cross arch experience is barely existing, so...

Andres

#8Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#6)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 10:40:10 Simon Riggs wrote:

On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:

Full resolution patch attached for Startup process waits on buffer
pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Wouldnt it be more foolproof to also loop around sending the FATAL? Not
that its likely but...

More foolproof and much less accurate. The Startup process doesn't know
who is holding the buffer pin that blocks it, so it could not target a
FATAL.

From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that
one is acquiring a spinlock which can also get taken at other places.
Its not the most likely scenario, but it would certainly be annoying to
debug.

Spinlock. It isn't held for long in any situation. What problem do you
foresee?

If any backend is signalled while currently holding the ProcStructLock there
is a basically unrecoverable deadlock - its not likely but possible.

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?

I prefer spinlocking.

Well, its deadlock land taking the same lock inside and outside of a signal
handler...

Andres

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#7)
Re: An example of bugs for Hot Standby

On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:

LWLockAcquire

I'm using spinlocks, not lwlocks.

--
Simon Riggs www.2ndQuadrant.com

#10Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#9)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:

On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:

LWLockAcquire

I'm using spinlocks, not lwlocks.

CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which in
turn used by CheckStandbyTimeout triggered by SIGALRM acquires the lwlock.

Now that case is a bit less dangerous because you would have to interrupt
yourself to trigger a deadlock there because the code sleeps soon after
setting up the handler.
If ever two SIGALRM occur consecutive there is a problem.

Andres

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#10)
Re: An example of bugs for Hot Standby

On Wed, 2010-01-20 at 11:04 +0100, Andres Freund wrote:

On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:

On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:

LWLockAcquire

I'm using spinlocks, not lwlocks.

CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which in
turn used by CheckStandbyTimeout triggered by SIGALRM acquires the lwlock.

Those are used in similar ways to deadlock detection.

Now that case is a bit less dangerous because you would have to interrupt
yourself to trigger a deadlock there because the code sleeps soon after
setting up the handler.
If ever two SIGALRM occur consecutive there is a problem.

I'll protect against subsequent calls.

--
Simon Riggs www.2ndQuadrant.com

#12Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#11)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 11:33:05 Simon Riggs wrote:

On Wed, 2010-01-20 at 11:04 +0100, Andres Freund wrote:

On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:

On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:

LWLockAcquire

I'm using spinlocks, not lwlocks.

CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which
in turn used by CheckStandbyTimeout triggered by SIGALRM acquires the
lwlock.

Those are used in similar ways to deadlock detection.

But only if
ImmediateInterruptOK && InterruptHoldoffCount == 0 && CritSectionCount == 0 -
which is not the case with HoldingBufferPinThatDelaysRecovery.

Andres

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: An example of bugs for Hot Standby

Andres Freund <andres@anarazel.de> writes:

On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:

Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.

The reason I have asked is that the code is doing things like:
[ grabbing a spinlock to read a single integer ]

Yes, I think we probably actually need that. The problem is not so
much whether the read is an atomic operation as whether you can rely
on getting an up-to-date value. On multiprocessors with weak memory
ordering you need some type of "sync" instruction to be sure you will
see a value that was recently written by another processor. Currently,
we embed such instructions in the spinlock acquire/release code.
There's been some discussion of exposing memory sync independently
of lock acquisition; perhaps that would be enough here, but I haven't
looked at the surrounding logic enough to say.

My complaint at the top was responding to the idea that someone might
be supposing the specific type sig_atomic_t was at least as wide as
int. That's a different matter altogether. We do assume in some places
that we can read or write the specific type TransactionId indivisibly,
but we don't try to declare it as sig_atomic_t.

or similar things with LWLockAcquire in a signal handler

[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: An example of bugs for Hot Standby

On Wednesday 20 January 2010 17:30:04 Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:

Er ... what? I believe there are live platforms with sig_atomic_t =
char. If we're assuming more that's a must-fix.

The reason I have asked is that the code is doing things like:
[ grabbing a spinlock to read a single integer ]

Yes, I think we probably actually need that. The problem is not so
much whether the read is an atomic operation as whether you can rely
on getting an up-to-date value. On multiprocessors with weak memory
ordering you need some type of "sync" instruction to be sure you will
see a value that was recently written by another processor. Currently,
we embed such instructions in the spinlock acquire/release code.
There's been some discussion of exposing memory sync independently
of lock acquisition; perhaps that would be enough here, but I haven't
looked at the surrounding logic enough to say.

I think it should be enough.

I realize its way too late in the cycle for that, but why dont we start using
some library for easy cross platform atomic ops? I think libatomic or such
should support the required platforms.

My complaint at the top was responding to the idea that someone might
be supposing the specific type sig_atomic_t was at least as wide as
int. That's a different matter altogether. We do assume in some places
that we can read or write the specific type TransactionId indivisibly,
but we don't try to declare it as sig_atomic_t.

So we already assume that? Fine.

(yes, the sig_atomic_t was a sidetrack - I had memorized it wrongly as "the
biggest value that can be read/written atomically which is *clearly* wrong)

or similar things with LWLockAcquire in a signal handler

[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?

Yes the patch does that at two places. Thats what I was complaining about and
what triggered my sig_atomic_t question because of the above explained
misunderstanding.

Andres

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: An example of bugs for Hot Standby

Andres Freund <andres@anarazel.de> writes:

I realize its way too late in the cycle for that, but why dont we start using
some library for easy cross platform atomic ops?

(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.

[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?

Yes the patch does that at two places.

That's a must-fix.

regards, tom lane

#16Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: An example of bugs for Hot Standby

Hi Tom, Hi Simon,

On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I realize its way too late in the cycle for that, but why dont we start
using some library for easy cross platform atomic ops?

(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.

That there might be some other instructions were interested in?
Like really atomic increment?

[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?

Yes the patch does that at two places.

That's a must-fix.

Its code intended to fix a existing problem not already comitted code. But
otherwise I definitely agree.

Andres

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#16)
Synchronization primitives (Was: Re: An example of bugs for Hot Standby)

Andres Freund wrote:

On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I realize its way too late in the cycle for that, but why dont we start
using some library for easy cross platform atomic ops?

(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.

That there might be some other instructions were interested in?
Like really atomic increment?

This reminds me of something I've been pondering for some time:

Streaming Replication introduces a few places with a polling pattern
like this (in pseudocode):

while()
{
/* Check if variable in shared has advanced beoynd X */
SpinLockAcquire()
localvar = sharedvar;
SpinLockRelease()
if (localvar > X)
break;

/* Not yet. Sleep
pg_usleep(100);
}

For example, startup process polls like that to wait for walreceiver to
write & flush new WAL to be replayed. And in master, walsender polls
like that for new WAL to be generated, so that it can be sent to
standby. Hot standby also has a polling loop where it waits for a
transaction a transaction to die, though I'm not sure if that can be fit
into the same model.

That's OK for asynchronous replication, but unacceptable for synchronous
mode.

It would be nice to have a new synchronization primitive for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#17)
Re: Synchronization primitives (Was: Re: An example of bugs for Hot Standby)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Streaming Replication introduces a few places with a polling pattern
like this (in pseudocode):

while()
{
/* Check if variable in shared has advanced beoynd X */
SpinLockAcquire()
localvar = sharedvar;
SpinLockRelease()
if (localvar > X)
break;

/* Not yet. Sleep
pg_usleep(100);
}

I trust there's a CHECK_FOR_INTERRUPTS in there ...

It would be nice to have a new synchronization primitive for that.

Maybe. The lock, the variable, the comparison operation, and the sleep
time all seem rather specific to each application. Not sure that it'd
really buy much to try to turn it into a generic subroutine.

regards, tom lane

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#17)
Re: Synchronization primitives (Was: Re: An example of bugs for Hot Standby)

On Wed, 2010-01-20 at 20:00 +0200, Heikki Linnakangas wrote:

Hot standby also has a polling loop where it waits for a
transaction a transaction to die, though I'm not sure if that can be
fit into the same model

I prefer that in the context of HS because the Startup process is
waiting for things to die. Given that their death may not be handled
sweetly, I would not wish to rely on that to wake Startup.

In the other two cases you mention all processes are working together
normally and we aren't expecting the other processes to die.

--
Simon Riggs www.2ndQuadrant.com

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#14)
Re: An example of bugs for Hot Standby

On Wed, 2010-01-20 at 17:40 +0100, Andres Freund wrote:

or similar things with LWLockAcquire in a signal handler

[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?

Yes the patch does that at two places.

I think it would be more sensible to discuss specific code and issues,
rather than have general discussions about various horrors.

You've already pointed out that I need to prevent multiple sigalrm
interrupts using boolean flags; I've already said that I would do that.
The use of locks themselves are clearly not a problem, since the
existing sigalrm handler takes LWlocks for deadlock detection. The
problem is just about being called multiple times.

The code in HoldingBufferPinThatDelaysRecovery() also needs protection
against being interrupted multiple times, but we should note that a
second signal of that type is not going to arrive from anywhere inside
the server and requires an explicit user action. The locking isn't
strictly necessary since the value is only read when the only process
that ever writes that value is sleeping on a semaphore. The single
integer value can always be read atomically anyway.

So I will remove the locking in XXXStartupBufferPinWaitBufId(), add in
the booleans and we're done.

--
Simon Riggs www.2ndQuadrant.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#22)
#24David Fetter
david@fetter.org
In reply to: Heikki Linnakangas (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#24)
#26Hiroyuki Yamada
yamada@kokolink.net
In reply to: Simon Riggs (#2)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Hiroyuki Yamada (#26)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Hiroyuki Yamada (#26)
#29Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)