An example of bugs for Hot Standby
Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.
0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A
...then in node B occurs following "deadlock" situation, which is not detected by deadlock check.
* startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup())
* xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y
This situation can occur when
a) a transaction in the standby node tries to acquire ACCESS SHARE lock while holding some buffers
b) startup process calls LockBufferForCleanup() for any of the buffers
regards,
--
Hiroyuki YAMADA
Kokolink Corporation
yamada@kokolink.net
On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:
Hot Standby node can freeze when startup process calls LockBufferForCleanup().
This bug can be reproduced by the following procedure.0. start Hot Standby, with one active node(node A) and one standby node(node B)
1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1 open
4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A...then in node B occurs following "deadlock" situation, which is not detected by deadlock check.
* startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup())
* xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y
Deadlock bug was prevented by stop-gap measure in December commit.
Full resolution patch attached for Startup process waits on buffer pins.
Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.
Reviews welcome, otherwise commit at end of week.
--
Simon Riggs www.2ndQuadrant.com
Attachments:
startup_sigalrm.patchtext/x-patch; charset=UTF-8; name=startup_sigalrm.patchDownload
*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 2399,2405 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</listitem>
<listitem>
<para>
! Waiting to acquire buffer cleanup locks (for which there is no time out)
</para>
</listitem>
<listitem>
--- 2399,2405 ----
</listitem>
<listitem>
<para>
! Waiting to acquire buffer cleanup locks
</para>
</listitem>
<listitem>
***************
*** 2536,2546 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
Three-way deadlocks are possible between AccessExclusiveLocks arriving from
the primary, cleanup WAL records that require buffer cleanup locks and
user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
! are currently resolved by the cancellation of user processes that would
! need to wait on a lock. This is heavy-handed and generates more query
! cancellations than we need to, though does remove the possibility of deadlock.
! This behaviour is expected to improve substantially for the main release
! version of 8.5.
</para>
<para>
--- 2536,2542 ----
Three-way deadlocks are possible between AccessExclusiveLocks arriving from
the primary, cleanup WAL records that require buffer cleanup locks and
user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
! are resolved by time-out when we exceed <varname>max_standby_delay</>.
</para>
<para>
***************
*** 2630,2640 **** LOG: database system is ready to accept read only connections
<varname>max_standby_delay</> or even set it to zero, though that is a
very aggressive setting. If the standby server is tasked as an additional
server for decision support queries then it may be acceptable to set this
! to a value of many hours (in seconds). It is also possible to set
! <varname>max_standby_delay</> to -1 which means wait forever for queries
! to complete, if there are conflicts; this will be useful when performing
! an archive recovery from a backup.
! </para>
<para>
Transaction status "hint bits" written on primary are not WAL-logged,
--- 2626,2632 ----
<varname>max_standby_delay</> or even set it to zero, though that is a
very aggressive setting. If the standby server is tasked as an additional
server for decision support queries then it may be acceptable to set this
! to a value of many hours (in seconds).
<para>
Transaction status "hint bits" written on primary are not WAL-logged,
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1825,1838 **** archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows
<listitem>
<para>
When server acts as a standby, this parameter specifies a wait policy
! for queries that conflict with incoming data changes. Valid settings
! are -1, meaning wait forever, or a wait time of 0 or more seconds.
! If a conflict should occur the server will delay up to this
! amount before it begins trying to resolve things less amicably, as
described in <xref linkend="hot-standby-conflict">. Typically,
this parameter makes sense only during replication, so when
! performing an archive recovery to recover from data loss a
! parameter setting of 0 is recommended. The default is 30 seconds.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
--- 1825,1839 ----
<listitem>
<para>
When server acts as a standby, this parameter specifies a wait policy
! for queries that conflict with data changes being replayed by recovery.
! If a conflict should occur the server will delay up to this number
! of seconds before it begins trying to resolve things less amicably, as
described in <xref linkend="hot-standby-conflict">. Typically,
this parameter makes sense only during replication, so when
! performing an archive recovery to recover from data loss a very high
! parameter setting is recommended. The default is 30 seconds.
! There is no wait-forever setting because of the potential for deadlock
! which that setting would introduce.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 8759,8767 **** StartupProcessMain(void)
*/
pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */
! pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
! pqsignal(SIGQUIT, startupproc_quickdie); /* hard crash time */
! pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, SIG_IGN);
pqsignal(SIGUSR2, SIG_IGN);
--- 8759,8770 ----
*/
pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */
! pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
! pqsignal(SIGQUIT, startupproc_quickdie); /* hard crash time */
! if (XLogRequestRecoveryConnections)
! pqsignal(SIGALRM, handle_standby_sig_alarm); /* ignored unless InHotStandby */
! else
! pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, SIG_IGN);
pqsignal(SIGUSR2, SIG_IGN);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 44,49 ****
--- 44,50 ----
#include "storage/ipc.h"
#include "storage/proc.h"
#include "storage/smgr.h"
+ #include "storage/standby.h"
#include "utils/rel.h"
#include "utils/resowner.h"
***************
*** 2417,2430 **** LockBufferForCleanup(Buffer buffer)
PinCountWaitBuf = bufHdr;
UnlockBufHdr(bufHdr);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
/* Wait to be signaled by UnpinBuffer() */
! ProcWaitForSignal();
PinCountWaitBuf = NULL;
/* Loop back and try again */
}
}
/*
* ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
*
* We won't loop, but just check once to see if the pin count is OK. If
--- 2418,2459 ----
PinCountWaitBuf = bufHdr;
UnlockBufHdr(bufHdr);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
/* Wait to be signaled by UnpinBuffer() */
! if (InHotStandby)
! {
! /* Share the bufid that Startup process waits on */
! SetStartupBufferPinWaitBufId(buffer - 1);
! /* Set alarm and then wait to be signaled by UnpinBuffer() */
! ResolveRecoveryConflictWithBufferPin();
! SetStartupBufferPinWaitBufId(-1);
! }
! else
! ProcWaitForSignal();
!
PinCountWaitBuf = NULL;
/* Loop back and try again */
}
}
/*
+ * Check called from RecoveryConflictInterrupt handler when Startup
+ * process requests cancelation of all pin holders that are blocking it.
+ */
+ bool
+ HoldingBufferPinThatDelaysRecovery(void)
+ {
+ int bufid = GetStartupBufferPinWaitBufId();
+
+ Assert(bufid >= 0);
+
+ if (PrivateRefCount[bufid] > 0)
+ return true;
+
+ return false;
+ }
+
+ /*
* ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
*
* We won't loop, but just check once to see if the pin count is OK. If
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1620,1634 **** GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
* latestCompletedXid since doing so would be a performance issue during
* normal running, so we check it essentially for free on the standby.
*
! * If dbOid is valid we skip backends attached to other databases. Some
! * callers choose to skipExistingConflicts.
*
* Be careful to *not* pfree the result from this function. We reuse
* this array sufficiently often that we use malloc for the result.
*/
VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
! bool skipExistingConflicts)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
--- 1620,1632 ----
* latestCompletedXid since doing so would be a performance issue during
* normal running, so we check it essentially for free on the standby.
*
! * If dbOid is valid we skip backends attached to other databases.
*
* Be careful to *not* pfree the result from this function. We reuse
* this array sufficiently often that we use malloc for the result.
*/
VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
***************
*** 1667,1675 **** GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
if (proc->pid == 0)
continue;
- if (skipExistingConflicts && proc->recoveryConflictPending)
- continue;
-
if (!OidIsValid(dbOid) ||
proc->databaseId == dbOid)
{
--- 1665,1670 ----
***************
*** 1826,1832 **** CountDBBackends(Oid databaseid)
* CancelDBBackends --- cancel backends that are using specified database
*/
void
! CancelDBBackends(Oid databaseid)
{
ProcArrayStruct *arrayP = procArray;
int index;
--- 1821,1827 ----
* CancelDBBackends --- cancel backends that are using specified database
*/
void
! CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
{
ProcArrayStruct *arrayP = procArray;
int index;
***************
*** 1839,1851 **** CancelDBBackends(Oid databaseid)
{
volatile PGPROC *proc = arrayP->procs[index];
! if (proc->databaseId == databaseid)
{
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
! proc->recoveryConflictPending = true;
pid = proc->pid;
if (pid != 0)
{
--- 1834,1846 ----
{
volatile PGPROC *proc = arrayP->procs[index];
! if (databaseid == InvalidOid || proc->databaseId == databaseid)
{
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
! proc->recoveryConflictPending = conflictPending;
pid = proc->pid;
if (pid != 0)
{
***************
*** 1853,1860 **** CancelDBBackends(Oid databaseid)
* Kill the pid if it's still here. If not, that's what we wanted
* so ignore any errors.
*/
! (void) SendProcSignal(pid, PROCSIG_RECOVERY_CONFLICT_DATABASE,
! procvxid.backendId);
}
}
}
--- 1848,1854 ----
* Kill the pid if it's still here. If not, that's what we wanted
* so ignore any errors.
*/
! (void) SendProcSignal(pid, sigmode, procvxid.backendId);
}
}
}
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 272,276 **** procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 272,279 ----
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+ if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
+ RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+
errno = save_errno;
}
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 126,135 **** WaitExceedsMaxStandbyDelay(void)
long delay_secs;
int delay_usecs;
- /* max_standby_delay = -1 means wait forever, if necessary */
- if (MaxStandbyDelay < 0)
- return false;
-
/* Are we past max_standby_delay? */
TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(),
&delay_secs, &delay_usecs);
--- 126,131 ----
***************
*** 241,248 **** ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid)
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(latestRemovedXid,
! InvalidOid,
! true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
--- 237,243 ----
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(latestRemovedXid,
! InvalidOid);
ResolveRecoveryConflictWithVirtualXIDs(backends,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
***************
*** 273,280 **** ResolveRecoveryConflictWithTablespace(Oid tsid)
* non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid,
! false);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
}
--- 268,274 ----
* non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
}
***************
*** 295,301 **** ResolveRecoveryConflictWithDatabase(Oid dbid)
*/
while (CountDBBackends(dbid) > 0)
{
! CancelDBBackends(dbid);
/*
* Wait awhile for them to die so that we avoid flooding an
--- 289,295 ----
*/
while (CountDBBackends(dbid) > 0)
{
! CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, true);
/*
* Wait awhile for them to die so that we avoid flooding an
***************
*** 331,338 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
else
{
backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid,
! true);
report_memory_error = true;
}
--- 325,331 ----
else
{
backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid);
report_memory_error = true;
}
***************
*** 346,351 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
--- 339,451 ----
}
/*
+ * ResolveRecoveryConflictWithBufferPin is called from LockBufferForCleanup()
+ * to resolve conflicts with other backends holding buffer pins.
+ *
+ * We either resolve conflicts immediately or set a SIGALRM to wake us at
+ * the limit of our patience. The sleep in LockBufferForCleanup() is
+ * performed here, for code clarity.
+ *
+ * Resolve conflict by sending a SIGUSR1 reason to all backends to check if
+ * they hold one of the buffer pins that is blocking Startup process. If so,
+ * backends will take an appropriate error action, ERROR or FATAL.
+ *
+ * A secondary purpose of this is to avoid deadlocks that might occur between
+ * the Startup process and lock waiters. Deadlocks occur because if queries
+ * wait on a lock, that must be behind an AccessExclusiveLock, which can only
+ * be clared if the Startup process replays a transaction completion record.
+ * If Startup process is waiting then that is a deadlock. If we allowed a
+ * setting of max_standby_delay that meant "wait forever" we would then need
+ * special code to protect against deadlock. Such deadlocks are rare, so the
+ * code would be almost certainly buggy, so we avoid both long waits and
+ * deadlocks using the same mechanism.
+ */
+ void
+ ResolveRecoveryConflictWithBufferPin(void)
+ {
+ bool sig_alarm_enabled = false;
+
+ Assert(InHotStandby);
+
+ /*
+ * Signal immediately or set alarm for later.
+ */
+ if (MaxStandbyDelay == 0)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ TimestampTz now;
+ long standby_delay_secs; /* How far Startup process is lagging */
+ int standby_delay_usecs;
+
+ now = GetCurrentTimestamp();
+
+ /* Are we past max_standby_delay? */
+ TimestampDifference(GetLatestXLogTime(), now,
+ &standby_delay_secs, &standby_delay_usecs);
+
+ if (standby_delay_secs >= (long) MaxStandbyDelay)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ TimestampTz fin_time; /* Expected wake-up time by timer */
+ long timer_delay_secs; /* Amount of time we set timer for */
+ int timer_delay_usecs = 0;
+
+ /*
+ * How much longer we should wait?
+ */
+ timer_delay_secs = MaxStandbyDelay - standby_delay_secs;
+ if (standby_delay_usecs > 0)
+ {
+ timer_delay_secs -= 1;
+ timer_delay_usecs = 1000000 - standby_delay_usecs;
+ }
+
+ /*
+ * It's possible that the difference is less than a microsecond;
+ * ensure we don't cancel, rather than set, the interrupt.
+ */
+ if (timer_delay_secs == 0 && timer_delay_usecs == 0)
+ timer_delay_usecs = 1;
+
+ /*
+ * When is the finish time? We recheck this if we are woken early.
+ */
+ fin_time = TimestampTzPlusMilliseconds(now,
+ (timer_delay_secs * 1000) +
+ (timer_delay_usecs / 1000));
+
+ if (enable_standby_sig_alarm(timer_delay_secs, timer_delay_usecs, fin_time))
+ sig_alarm_enabled = true;
+ else
+ elog(FATAL, "could not set timer for process wakeup");
+ }
+ }
+
+ /* Wait to be signaled by UnpinBuffer() */
+ ProcWaitForSignal();
+
+ if (sig_alarm_enabled)
+ {
+ if (!disable_standby_sig_alarm())
+ elog(FATAL, "could not disable timer for process wakeup");
+ }
+ }
+
+ void
+ SendRecoveryConflictWithBufferPin(void)
+ {
+ /*
+ * We send signal to all backends to ask them if they are holding
+ * the buffer pin which is delaying the Startup process. We must
+ * not set the conflict flag yet, since most backends will be innocent.
+ * Let the SIGUSR1 handling in each backend decide their own fate.
+ */
+ CancelDBBackends(InvalidOid, PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, false);
+ }
+
+ /*
* -----------------------------------------------------
* Locking in Recovery Mode
* -----------------------------------------------------
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 815,839 **** LockAcquireExtended(const LOCKTAG *locktag,
}
/*
- * In Hot Standby we abort the lock wait if Startup process is waiting
- * since this would result in a deadlock. The deadlock occurs because
- * if we are waiting it must be behind an AccessExclusiveLock, which
- * can only clear when a transaction completion record is replayed.
- * If Startup process is waiting we never will clear that lock, so to
- * wait for it just causes a deadlock.
- */
- if (RecoveryInProgress() && !InRecovery &&
- locktag->locktag_type == LOCKTAG_RELATION)
- {
- LWLockRelease(partitionLock);
- ereport(ERROR,
- (errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
- errmsg("possible deadlock detected"),
- errdetail("process conflicts with recovery - please resubmit query later"),
- errdetail_log("process conflicts with recovery")));
- }
-
- /*
* Set bitmask of locks this process already holds on this object.
*/
MyProc->heldLocks = proclock->holdMask;
--- 815,820 ----
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 73,78 **** NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
--- 73,79 ----
static LOCALLOCK *lockAwaited = NULL;
/* Mark these volatile because they can be changed by signal handler */
+ static volatile bool standby_timeout_active = false;
static volatile bool statement_timeout_active = false;
static volatile bool deadlock_timeout_active = false;
static volatile DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
***************
*** 89,94 **** static void RemoveProcFromArray(int code, Datum arg);
--- 90,96 ----
static void ProcKill(int code, Datum arg);
static void AuxiliaryProcKill(int code, Datum arg);
static bool CheckStatementTimeout(void);
+ static bool CheckStandbyTimeout(void);
/*
***************
*** 107,112 **** ProcGlobalShmemSize(void)
--- 109,116 ----
size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
/* ProcStructLock */
size = add_size(size, sizeof(slock_t));
+ /* startupBufferPinWaitBufId */
+ size = add_size(size, sizeof(NBuffers));
return size;
}
***************
*** 487,497 **** PublishStartupProcessInformation(void)
--- 491,540 ----
procglobal->startupProc = MyProc;
procglobal->startupProcPid = MyProcPid;
+ procglobal->startupBufferPinWaitBufId = 0;
SpinLockRelease(ProcStructLock);
}
/*
+ * Used from bufgr to share the value of the buffer that Startup waits on,
+ * or to reset the value to "not waiting" (-1). This allows processing
+ * of recovery conflicts for buffer pins.
+ */
+ void
+ SetStartupBufferPinWaitBufId(int bufid)
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile PROC_HDR *procglobal = ProcGlobal;
+
+ SpinLockAcquire(ProcStructLock);
+
+ procglobal->startupBufferPinWaitBufId = bufid;
+
+ SpinLockRelease(ProcStructLock);
+ }
+
+ /*
+ * Used by backends when they receive a request to check for buffer pin waits.
+ */
+ int
+ GetStartupBufferPinWaitBufId(void)
+ {
+ int bufid;
+
+ /* use volatile pointer to prevent code rearrangement */
+ volatile PROC_HDR *procglobal = ProcGlobal;
+
+ SpinLockAcquire(ProcStructLock);
+
+ bufid = procglobal->startupBufferPinWaitBufId;
+
+ SpinLockRelease(ProcStructLock);
+
+ return bufid;
+ }
+
+ /*
* Check whether there are at least N free PGPROC objects.
*
* Note: this is designed on the assumption that N will generally be small.
***************
*** 1542,1548 **** CheckStatementTimeout(void)
/*
! * Signal handler for SIGALRM
*
* Process deadlock check and/or statement timeout check, as needed.
* To avoid various edge cases, we must be careful to do nothing
--- 1585,1591 ----
/*
! * Signal handler for SIGALRM for normal user backends
*
* Process deadlock check and/or statement timeout check, as needed.
* To avoid various edge cases, we must be careful to do nothing
***************
*** 1565,1567 **** handle_sig_alarm(SIGNAL_ARGS)
--- 1608,1714 ----
errno = save_errno;
}
+
+ /*
+ * Signal handler for SIGALRM in Startup process
+ *
+ * To avoid various edge cases, we must be careful to do nothing
+ * when there is nothing to be done. We also need to be able to
+ * reschedule the timer interrupt if called before end of statement.
+ */
+ bool
+ enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time)
+ {
+ struct itimerval timeval;
+
+ Assert(delay_s >= 0 && delay_us >= 0);
+
+ statement_fin_time = fin_time;
+
+ standby_timeout_active = true;
+
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ timeval.it_value.tv_sec = delay_s;
+ timeval.it_value.tv_usec = delay_us;
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ return false;
+ return true;
+ }
+
+ bool
+ disable_standby_sig_alarm(void)
+ {
+ /*
+ * Always disable the interrupt if it is active; this avoids being
+ * interrupted by the signal handler and thereby possibly getting
+ * confused.
+ *
+ * We will re-enable the interrupt if necessary in CheckStandbyTimeout.
+ */
+ if (standby_timeout_active)
+ {
+ struct itimerval timeval;
+
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ {
+ standby_timeout_active = false;
+ return false;
+ }
+ }
+
+ return true;
+ }
+
+ /*
+ * CheckStandbyTimeout() runs unconditionally in the Startup process
+ * SIGALRM handler. Timers will only be set when InHotStandby.
+ * We simply ignore any signals unless the timer has been set.
+ */
+ static bool
+ CheckStandbyTimeout(void)
+ {
+ TimestampTz now;
+
+ if (!standby_timeout_active)
+ return true; /* do nothing if not active */
+
+ now = GetCurrentTimestamp();
+
+ if (now >= statement_fin_time)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ /* Not time yet, so (re)schedule the interrupt */
+ long secs;
+ int usecs;
+ struct itimerval timeval;
+
+ TimestampDifference(now, statement_fin_time,
+ &secs, &usecs);
+
+ /*
+ * It's possible that the difference is less than a microsecond;
+ * ensure we don't cancel, rather than set, the interrupt.
+ */
+ if (secs == 0 && usecs == 0)
+ usecs = 1;
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ timeval.it_value.tv_sec = secs;
+ timeval.it_value.tv_usec = usecs;
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ return false;
+ }
+
+ return true;
+ }
+
+ void
+ handle_standby_sig_alarm(SIGNAL_ARGS)
+ {
+ int save_errno = errno;
+
+ (void) CheckStandbyTimeout();
+
+ errno = save_errno;
+ }
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2718,2723 **** RecoveryConflictInterrupt(ProcSignalReason reason)
--- 2718,2735 ----
{
switch (reason)
{
+ case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ /*
+ * If we aren't blocking the Startup process there is
+ * nothing more to do.
+ */
+ if (!HoldingBufferPinThatDelaysRecovery())
+ return;
+
+ MyProc->recoveryConflictPending = true;
+
+ /* Intentional drop through to error handling */
+
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1383,1389 **** static struct config_int ConfigureNamesInt[] =
NULL
},
&MaxStandbyDelay,
! 30, -1, INT_MAX, NULL, NULL
},
{
--- 1383,1389 ----
NULL
},
&MaxStandbyDelay,
! 30, 0, INT_MAX, NULL, NULL
},
{
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 198,203 **** extern void LockBuffer(Buffer buffer, int mode);
--- 198,204 ----
extern bool ConditionalLockBuffer(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+ extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 16,22 ****
#include "storage/lock.h"
#include "storage/pg_sema.h"
!
/*
* Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
--- 16,22 ----
#include "storage/lock.h"
#include "storage/pg_sema.h"
! #include "utils/timestamp.h"
/*
* Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
***************
*** 145,150 **** typedef struct PROC_HDR
--- 145,152 ----
/* The proc of the Startup process, since not in ProcArray */
PGPROC *startupProc;
int startupProcPid;
+ /* Buffer id of the buffer that Startup process waits for pin on */
+ int startupBufferPinWaitBufId;
} PROC_HDR;
/*
***************
*** 177,182 **** extern void InitProcessPhase2(void);
--- 179,186 ----
extern void InitAuxiliaryProcess(void);
extern void PublishStartupProcessInformation(void);
+ extern void SetStartupBufferPinWaitBufId(int bufid);
+ extern int GetStartupBufferPinWaitBufId(void);
extern bool HaveNFreeProcs(int n);
extern void ProcReleaseLocks(bool isCommit);
***************
*** 194,197 **** extern bool enable_sig_alarm(int delayms, bool is_statement_timeout);
--- 198,205 ----
extern bool disable_sig_alarm(bool is_statement_timeout);
extern void handle_sig_alarm(SIGNAL_ARGS);
+ extern bool enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time);
+ extern bool disable_standby_sig_alarm(void);
+ extern void handle_standby_sig_alarm(SIGNAL_ARGS);
+
#endif /* PROC_H */
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 57,69 **** extern bool IsBackendPid(int pid);
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
bool excludeXmin0, bool allDbs, int excludeVacuum,
int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin,
! Oid dbOid, bool skipExistingConflicts);
extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
extern int CountActiveBackends(void);
extern int CountDBBackends(Oid databaseid);
! extern void CancelDBBackends(Oid databaseid);
extern int CountUserBackends(Oid roleid);
extern bool CountOtherDBBackends(Oid databaseId,
int *nbackends, int *nprepared);
--- 57,68 ----
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
bool excludeXmin0, bool allDbs, int excludeVacuum,
int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
extern int CountActiveBackends(void);
extern int CountDBBackends(Oid databaseid);
! extern void CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending);
extern int CountUserBackends(Oid roleid);
extern bool CountOtherDBBackends(Oid databaseId,
int *nbackends, int *nprepared);
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 37,42 **** typedef enum
--- 37,43 ----
PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
PROCSIG_RECOVERY_CONFLICT_LOCK,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+ PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
NUM_PROCSIGNALS /* Must be last! */
} ProcSignalReason;
*** a/src/include/storage/standby.h
--- b/src/include/storage/standby.h
***************
*** 19,30 ****
extern int vacuum_defer_cleanup_age;
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
! extern void InitRecoveryTransactionEnvironment(void);
! extern void ShutdownRecoveryTransactionEnvironment(void);
/*
* Standby Rmgr (RM_STANDBY_ID)
--- 19,33 ----
extern int vacuum_defer_cleanup_age;
+ extern void InitRecoveryTransactionEnvironment(void);
+ extern void ShutdownRecoveryTransactionEnvironment(void);
+
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
! extern void ResolveRecoveryConflictWithBufferPin(void);
! extern void SendRecoveryConflictWithBufferPin(void);
/*
* Standby Rmgr (RM_STANDBY_ID)
On Tuesday 19 January 2010 11:46:24 Simon Riggs wrote:
On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:
Hot Standby node can freeze when startup process calls
LockBufferForCleanup(). This bug can be reproduced by the following
procedure.0. start Hot Standby, with one active node(node A) and one standby
node(node B) 1. create table X and table Y in node A
2. insert several rows in table X in node A
3. delete one row from table X in node A
4. begin xact 1 in node A, execute following commands, and leave xact 1
open 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
5. wait until WAL's for above actions are applied in node B
6. begin xact 2 in node B, and execute following commands
6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
6.2 FETCH test_cursor;
6.3 SELECT * FROM table Y;
7. execute VACUUM FREEZE table A in node A
8. commit xact 1 in node A...then in node B occurs following "deadlock" situation, which is not
detected by deadlock check.* startup process waits for xact 2 to release buffers in table X (in
LockBufferForCleanup()) * xact 2 waits for startup process to release
ACCESS EXCLUSIVE lock in table YDeadlock bug was prevented by stop-gap measure in December commit.
Full resolution patch attached for Startup process waits on buffer pins.
Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.
Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
its likely but...
From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
acquiring a spinlock which can also get taken at other places. Its not the
most likely scenario, but it would certainly be annoying to debug.
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?
Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-
SendRecoveryConflictWithBufferPin->CancelDBBackends
Too tired to look further.
Greetings,
Andres
PS: Sorry for not doing anything on the idle front - I have been burried alive
the last days.
Andres Freund <andres@anarazel.de> writes:
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so?
Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.
regards, tom lane
On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so?Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.
So were assuming genereally that a integer cannot be read/written
atomatically?
Or was that only relating to the actualy signal.h typedef?
Andres
On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:
Full resolution patch attached for Startup process waits on buffer pins.
Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.
Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
its likely but...
More foolproof and much less accurate. The Startup process doesn't know
who is holding the buffer pin that blocks it, so it could not target a
FATAL.
From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
acquiring a spinlock which can also get taken at other places. Its not the
most likely scenario, but it would certainly be annoying to debug.
Spinlock. It isn't held for long in any situation. What problem do you
foresee?
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?
I prefer spinlocking.
Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-
SendRecoveryConflictWithBufferPin->CancelDBBackends
I don't see an issue.
--
Simon Riggs www.2ndQuadrant.com
On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so?Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.
The reason I have asked is that the code is doing things like:
/*
* Used by backends when they receive a request to check for buffer pin waits.
*/
int
GetStartupBufferPinWaitBufId(void)
{
int bufid;
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;
SpinLockAcquire(ProcStructLock);
bufid = procglobal->startupBufferPinWaitBufId;
SpinLockRelease(ProcStructLock);
return bufid;
}
or similar things with LWLockAcquire in a signal handler which strikes me as a
not that good idea. As at least on x86 reading an integer is atomic the above
spinlock is pointless. My cross arch experience is barely existing, so...
Andres
On Wednesday 20 January 2010 10:40:10 Simon Riggs wrote:
On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:
Full resolution patch attached for Startup process waits on buffer
pins.Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.Wouldnt it be more foolproof to also loop around sending the FATAL? Not
that its likely but...More foolproof and much less accurate. The Startup process doesn't know
who is holding the buffer pin that blocks it, so it could not target a
FATAL.From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that
one is acquiring a spinlock which can also get taken at other places.
Its not the most likely scenario, but it would certainly be annoying to
debug.Spinlock. It isn't held for long in any situation. What problem do you
foresee?
If any backend is signalled while currently holding the ProcStructLock there
is a basically unrecoverable deadlock - its not likely but possible.
Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
doubt so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?I prefer spinlocking.
Well, its deadlock land taking the same lock inside and outside of a signal
handler...
Andres
On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
LWLockAcquire
I'm using spinlocks, not lwlocks.
--
Simon Riggs www.2ndQuadrant.com
On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:
On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
LWLockAcquire
I'm using spinlocks, not lwlocks.
CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which in
turn used by CheckStandbyTimeout triggered by SIGALRM acquires the lwlock.
Now that case is a bit less dangerous because you would have to interrupt
yourself to trigger a deadlock there because the code sleeps soon after
setting up the handler.
If ever two SIGALRM occur consecutive there is a problem.
Andres
On Wed, 2010-01-20 at 11:04 +0100, Andres Freund wrote:
On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:
On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
LWLockAcquire
I'm using spinlocks, not lwlocks.
CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which in
turn used by CheckStandbyTimeout triggered by SIGALRM acquires the lwlock.
Those are used in similar ways to deadlock detection.
Now that case is a bit less dangerous because you would have to interrupt
yourself to trigger a deadlock there because the code sleeps soon after
setting up the handler.
If ever two SIGALRM occur consecutive there is a problem.
I'll protect against subsequent calls.
--
Simon Riggs www.2ndQuadrant.com
On Wednesday 20 January 2010 11:33:05 Simon Riggs wrote:
On Wed, 2010-01-20 at 11:04 +0100, Andres Freund wrote:
On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:
On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
LWLockAcquire
I'm using spinlocks, not lwlocks.
CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which
in turn used by CheckStandbyTimeout triggered by SIGALRM acquires the
lwlock.Those are used in similar ways to deadlock detection.
But only if
ImmediateInterruptOK && InterruptHoldoffCount == 0 && CritSectionCount == 0 -
which is not the case with HoldingBufferPinThatDelaysRecovery.
Andres
Andres Freund <andres@anarazel.de> writes:
On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.
The reason I have asked is that the code is doing things like:
[ grabbing a spinlock to read a single integer ]
Yes, I think we probably actually need that. The problem is not so
much whether the read is an atomic operation as whether you can rely
on getting an up-to-date value. On multiprocessors with weak memory
ordering you need some type of "sync" instruction to be sure you will
see a value that was recently written by another processor. Currently,
we embed such instructions in the spinlock acquire/release code.
There's been some discussion of exposing memory sync independently
of lock acquisition; perhaps that would be enough here, but I haven't
looked at the surrounding logic enough to say.
My complaint at the top was responding to the idea that someone might
be supposing the specific type sig_atomic_t was at least as wide as
int. That's a different matter altogether. We do assume in some places
that we can read or write the specific type TransactionId indivisibly,
but we don't try to declare it as sig_atomic_t.
or similar things with LWLockAcquire in a signal handler
[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?
regards, tom lane
On Wednesday 20 January 2010 17:30:04 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
Er ... what? I believe there are live platforms with sig_atomic_t =
char. If we're assuming more that's a must-fix.The reason I have asked is that the code is doing things like:
[ grabbing a spinlock to read a single integer ]Yes, I think we probably actually need that. The problem is not so
much whether the read is an atomic operation as whether you can rely
on getting an up-to-date value. On multiprocessors with weak memory
ordering you need some type of "sync" instruction to be sure you will
see a value that was recently written by another processor. Currently,
we embed such instructions in the spinlock acquire/release code.
There's been some discussion of exposing memory sync independently
of lock acquisition; perhaps that would be enough here, but I haven't
looked at the surrounding logic enough to say.
I think it should be enough.
I realize its way too late in the cycle for that, but why dont we start using
some library for easy cross platform atomic ops? I think libatomic or such
should support the required platforms.
My complaint at the top was responding to the idea that someone might
be supposing the specific type sig_atomic_t was at least as wide as
int. That's a different matter altogether. We do assume in some places
that we can read or write the specific type TransactionId indivisibly,
but we don't try to declare it as sig_atomic_t.
So we already assume that? Fine.
(yes, the sig_atomic_t was a sidetrack - I had memorized it wrongly as "the
biggest value that can be read/written atomically which is *clearly* wrong)
or similar things with LWLockAcquire in a signal handler
[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?
Yes the patch does that at two places. Thats what I was complaining about and
what triggered my sig_atomic_t question because of the above explained
misunderstanding.
Andres
Andres Freund <andres@anarazel.de> writes:
I realize its way too late in the cycle for that, but why dont we start using
some library for easy cross platform atomic ops?
(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.
[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?
Yes the patch does that at two places.
That's a must-fix.
regards, tom lane
Hi Tom, Hi Simon,
On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I realize its way too late in the cycle for that, but why dont we start
using some library for easy cross platform atomic ops?(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.
That there might be some other instructions were interested in?
Like really atomic increment?
[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?Yes the patch does that at two places.
That's a must-fix.
Its code intended to fix a existing problem not already comitted code. But
otherwise I definitely agree.
Andres
Andres Freund wrote:
On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I realize its way too late in the cycle for that, but why dont we start
using some library for easy cross platform atomic ops?(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.That there might be some other instructions were interested in?
Like really atomic increment?
This reminds me of something I've been pondering for some time:
Streaming Replication introduces a few places with a polling pattern
like this (in pseudocode):
while()
{
/* Check if variable in shared has advanced beoynd X */
SpinLockAcquire()
localvar = sharedvar;
SpinLockRelease()
if (localvar > X)
break;
/* Not yet. Sleep
pg_usleep(100);
}
For example, startup process polls like that to wait for walreceiver to
write & flush new WAL to be replayed. And in master, walsender polls
like that for new WAL to be generated, so that it can be sent to
standby. Hot standby also has a polling loop where it waits for a
transaction a transaction to die, though I'm not sure if that can be fit
into the same model.
That's OK for asynchronous replication, but unacceptable for synchronous
mode.
It would be nice to have a new synchronization primitive for that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Streaming Replication introduces a few places with a polling pattern
like this (in pseudocode):
while()
{
/* Check if variable in shared has advanced beoynd X */
SpinLockAcquire()
localvar = sharedvar;
SpinLockRelease()
if (localvar > X)
break;
/* Not yet. Sleep
pg_usleep(100);
}
I trust there's a CHECK_FOR_INTERRUPTS in there ...
It would be nice to have a new synchronization primitive for that.
Maybe. The lock, the variable, the comparison operation, and the sleep
time all seem rather specific to each application. Not sure that it'd
really buy much to try to turn it into a generic subroutine.
regards, tom lane
On Wed, 2010-01-20 at 20:00 +0200, Heikki Linnakangas wrote:
Hot standby also has a polling loop where it waits for a
transaction a transaction to die, though I'm not sure if that can be
fit into the same model
I prefer that in the context of HS because the Startup process is
waiting for things to die. Given that their death may not be handled
sweetly, I would not wish to rely on that to wake Startup.
In the other two cases you mention all processes are working together
normally and we aren't expecting the other processes to die.
--
Simon Riggs www.2ndQuadrant.com
On Wed, 2010-01-20 at 17:40 +0100, Andres Freund wrote:
or similar things with LWLockAcquire in a signal handler
[ grows visibly pale ] *Please* tell me we are not trying to take
locks in a signal handler. What happens if it interrupts code that
is already holding that lock?
Yes the patch does that at two places.
I think it would be more sensible to discuss specific code and issues,
rather than have general discussions about various horrors.
You've already pointed out that I need to prevent multiple sigalrm
interrupts using boolean flags; I've already said that I would do that.
The use of locks themselves are clearly not a problem, since the
existing sigalrm handler takes LWlocks for deadlock detection. The
problem is just about being called multiple times.
The code in HoldingBufferPinThatDelaysRecovery() also needs protection
against being interrupted multiple times, but we should note that a
second signal of that type is not going to arrive from anywhere inside
the server and requires an explicit user action. The locking isn't
strictly necessary since the value is only read when the only process
that ever writes that value is sleeping on a semaphore. The single
integer value can always be read atomically anyway.
So I will remove the locking in XXXStartupBufferPinWaitBufId(), add in
the booleans and we're done.
--
Simon Riggs www.2ndQuadrant.com
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Streaming Replication introduces a few places with a polling pattern
like this (in pseudocode):while()
{
/* Check if variable in shared has advanced beoynd X */
SpinLockAcquire()
localvar = sharedvar;
SpinLockRelease()
if (localvar > X)
break;/* Not yet. Sleep
pg_usleep(100);
}I trust there's a CHECK_FOR_INTERRUPTS in there ...
It would be nice to have a new synchronization primitive for that.
Maybe. The lock, the variable, the comparison operation, and the sleep
time all seem rather specific to each application. Not sure that it'd
really buy much to try to turn it into a generic subroutine.
My point is that we should replace such polling loops with something
non-polling, using wait/signal or semaphores or something. That gets
quite a bit more complex. You'd probably still have the loop, but
instead of pg_usleep() you'd call some new primitive function that waits
until the shared variable changes.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
My point is that we should replace such polling loops with something
non-polling, using wait/signal or semaphores or something. That gets
quite a bit more complex. You'd probably still have the loop, but
instead of pg_usleep() you'd call some new primitive function that waits
until the shared variable changes.
Maybe someday --- it's certainly not something we need to mess with for
8.5. As Simon comments, getting it to work nicely in the face of corner
cases (like processes dying unexpectedly) could be a lot of work.
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
My point is that we should replace such polling loops with something
non-polling, using wait/signal or semaphores or something. That gets
quite a bit more complex. You'd probably still have the loop, but
instead of pg_usleep() you'd call some new primitive function that waits
until the shared variable changes.Maybe someday --- it's certainly not something we need to mess with for
8.5. As Simon comments, getting it to work nicely in the face of corner
cases (like processes dying unexpectedly) could be a lot of work.
Agreed, polling is good enough for 8.5.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Jan 20, 2010 at 09:22:49PM +0200, Heikki Linnakangas wrote:
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
My point is that we should replace such polling loops with something
non-polling, using wait/signal or semaphores or something. That gets
quite a bit more complex. You'd probably still have the loop, but
instead of pg_usleep() you'd call some new primitive function that waits
until the shared variable changes.Maybe someday --- it's certainly not something we need to mess with for
8.5. As Simon comments, getting it to work nicely in the face of corner
cases (like processes dying unexpectedly) could be a lot of work.Agreed, polling is good enough for 8.5.
Is this a TODO yet?
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes:
On Wed, Jan 20, 2010 at 09:22:49PM +0200, Heikki Linnakangas wrote:
My point is that we should replace such polling loops with something
non-polling, using wait/signal or semaphores or something.
Is this a TODO yet?
It hardly seems concrete enough for a TODO item.
regards, tom lane
Deadlock bug was prevented by stop-gap measure in December commit.
Full resolution patch attached for Startup process waits on buffer pins.
Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.Reviews welcome, otherwise commit at end of week.
I think the patch has two problems.
* disable_standby_sig_alarm() does not clear standby_timeout_active flag
when it succeeds in disabling the alarm.
* Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
with following scenario.
1. Two transactions, xact A and xact B, are running in a HotStandby server.
2. Xact A holds a pin on buffer X.
3. Startup process calls LockBufferForCleanup() for buffer X,
sets ProcGlobal->startupBufferPinWaitBufId = X,
sends PROCSIG_RECOVERY_CONFLICT_BUFFERPIN signal to both transactions,
and sleeps.
4. Xact A handles the signal,
aborts itself,
releases the pin on buffer X,
and awake startup process.
5. Startup process wakes up
and sets ProcGlobal->startupBufferPinWaitBufId = -1.
6. Xact B handles the signal,
checks ProcGlobal->startupBufferPinWaitBufId,
and fails in the assertion check in HoldingBufferPinThatDelaysRecovery().
regards,
--
Hiroyuki YAMADA
Kokolink Corporation
yamada@kokolink.net
On Thu, 2010-01-21 at 18:01 +0900, Hiroyuki Yamada wrote:
I think the patch has two problems.
* disable_standby_sig_alarm() does not clear standby_timeout_active flag
when it succeeds in disabling the alarm.
Ah, thanks.
* Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
with following scenario.
Agreed. Will remove Assert() because of race conditions.
Thanks,
--
Simon Riggs www.2ndQuadrant.com
On Thu, 2010-01-21 at 18:01 +0900, Hiroyuki Yamada wrote:
Deadlock bug was prevented by stop-gap measure in December commit.
Full resolution patch attached for Startup process waits on buffer pins.
Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.Reviews welcome, otherwise commit at end of week.
I think the patch has two problems.
* disable_standby_sig_alarm() does not clear standby_timeout_active flag
when it succeeds in disabling the alarm.* Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
with following scenario.
Updated patch, including changes from Andres and Hiroyuki.
--
Simon Riggs www.2ndQuadrant.com
Attachments:
sigalrm_startupbufferpin.v2.patchtext/x-patch; charset=UTF-8; name=sigalrm_startupbufferpin.v2.patchDownload
*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 2399,2405 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</listitem>
<listitem>
<para>
! Waiting to acquire buffer cleanup locks (for which there is no time out)
</para>
</listitem>
<listitem>
--- 2399,2405 ----
</listitem>
<listitem>
<para>
! Waiting to acquire buffer cleanup locks
</para>
</listitem>
<listitem>
***************
*** 2536,2546 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
Three-way deadlocks are possible between AccessExclusiveLocks arriving from
the primary, cleanup WAL records that require buffer cleanup locks and
user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
! are currently resolved by the cancellation of user processes that would
! need to wait on a lock. This is heavy-handed and generates more query
! cancellations than we need to, though does remove the possibility of deadlock.
! This behaviour is expected to improve substantially for the main release
! version of 8.5.
</para>
<para>
--- 2536,2542 ----
Three-way deadlocks are possible between AccessExclusiveLocks arriving from
the primary, cleanup WAL records that require buffer cleanup locks and
user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
! are resolved by time-out when we exceed <varname>max_standby_delay</>.
</para>
<para>
***************
*** 2630,2640 **** LOG: database system is ready to accept read only connections
<varname>max_standby_delay</> or even set it to zero, though that is a
very aggressive setting. If the standby server is tasked as an additional
server for decision support queries then it may be acceptable to set this
! to a value of many hours (in seconds). It is also possible to set
! <varname>max_standby_delay</> to -1 which means wait forever for queries
! to complete, if there are conflicts; this will be useful when performing
! an archive recovery from a backup.
! </para>
<para>
Transaction status "hint bits" written on primary are not WAL-logged,
--- 2626,2632 ----
<varname>max_standby_delay</> or even set it to zero, though that is a
very aggressive setting. If the standby server is tasked as an additional
server for decision support queries then it may be acceptable to set this
! to a value of many hours (in seconds).
<para>
Transaction status "hint bits" written on primary are not WAL-logged,
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1825,1838 **** archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows
<listitem>
<para>
When server acts as a standby, this parameter specifies a wait policy
! for queries that conflict with incoming data changes. Valid settings
! are -1, meaning wait forever, or a wait time of 0 or more seconds.
! If a conflict should occur the server will delay up to this
! amount before it begins trying to resolve things less amicably, as
described in <xref linkend="hot-standby-conflict">. Typically,
this parameter makes sense only during replication, so when
! performing an archive recovery to recover from data loss a
! parameter setting of 0 is recommended. The default is 30 seconds.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
--- 1825,1839 ----
<listitem>
<para>
When server acts as a standby, this parameter specifies a wait policy
! for queries that conflict with data changes being replayed by recovery.
! If a conflict should occur the server will delay up to this number
! of seconds before it begins trying to resolve things less amicably, as
described in <xref linkend="hot-standby-conflict">. Typically,
this parameter makes sense only during replication, so when
! performing an archive recovery to recover from data loss a very high
! parameter setting is recommended. The default is 30 seconds.
! There is no wait-forever setting because of the potential for deadlock
! which that setting would introduce.
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 8803,8811 **** StartupProcessMain(void)
*/
pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */
! pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
! pqsignal(SIGQUIT, startupproc_quickdie); /* hard crash time */
! pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, SIG_IGN);
pqsignal(SIGUSR2, SIG_IGN);
--- 8803,8814 ----
*/
pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
pqsignal(SIGINT, SIG_IGN); /* ignore query cancel */
! pqsignal(SIGTERM, StartupProcShutdownHandler); /* request shutdown */
! pqsignal(SIGQUIT, startupproc_quickdie); /* hard crash time */
! if (XLogRequestRecoveryConnections)
! pqsignal(SIGALRM, handle_standby_sig_alarm); /* ignored unless InHotStandby */
! else
! pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, SIG_IGN);
pqsignal(SIGUSR2, SIG_IGN);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 44,49 ****
--- 44,50 ----
#include "storage/ipc.h"
#include "storage/proc.h"
#include "storage/smgr.h"
+ #include "storage/standby.h"
#include "utils/rel.h"
#include "utils/resowner.h"
***************
*** 2417,2430 **** LockBufferForCleanup(Buffer buffer)
PinCountWaitBuf = bufHdr;
UnlockBufHdr(bufHdr);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
/* Wait to be signaled by UnpinBuffer() */
! ProcWaitForSignal();
PinCountWaitBuf = NULL;
/* Loop back and try again */
}
}
/*
* ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
*
* We won't loop, but just check once to see if the pin count is OK. If
--- 2418,2466 ----
PinCountWaitBuf = bufHdr;
UnlockBufHdr(bufHdr);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
/* Wait to be signaled by UnpinBuffer() */
! if (InHotStandby)
! {
! /* Share the bufid that Startup process waits on */
! SetStartupBufferPinWaitBufId(buffer - 1);
! /* Set alarm and then wait to be signaled by UnpinBuffer() */
! ResolveRecoveryConflictWithBufferPin();
! SetStartupBufferPinWaitBufId(-1);
! }
! else
! ProcWaitForSignal();
!
PinCountWaitBuf = NULL;
/* Loop back and try again */
}
}
/*
+ * Check called from RecoveryConflictInterrupt handler when Startup
+ * process requests cancelation of all pin holders that are blocking it.
+ */
+ bool
+ HoldingBufferPinThatDelaysRecovery(void)
+ {
+ int bufid = GetStartupBufferPinWaitBufId();
+
+ /*
+ * If we get woken slowly then it's possible that the Startup process
+ * was already woken by other backends before we got here. Also possible
+ * that we get here by multiple interrupts or interrupts at inappropriate
+ * times, so make sure we do nothing if the bufid is not set.
+ */
+ if (bufid < 0)
+ return false;
+
+ if (PrivateRefCount[bufid] > 0)
+ return true;
+
+ return false;
+ }
+
+ /*
* ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
*
* We won't loop, but just check once to see if the pin count is OK. If
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1680,1694 **** GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
* latestCompletedXid since doing so would be a performance issue during
* normal running, so we check it essentially for free on the standby.
*
! * If dbOid is valid we skip backends attached to other databases. Some
! * callers choose to skipExistingConflicts.
*
* Be careful to *not* pfree the result from this function. We reuse
* this array sufficiently often that we use malloc for the result.
*/
VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
! bool skipExistingConflicts)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
--- 1680,1692 ----
* latestCompletedXid since doing so would be a performance issue during
* normal running, so we check it essentially for free on the standby.
*
! * If dbOid is valid we skip backends attached to other databases.
*
* Be careful to *not* pfree the result from this function. We reuse
* this array sufficiently often that we use malloc for the result.
*/
VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
{
static VirtualTransactionId *vxids;
ProcArrayStruct *arrayP = procArray;
***************
*** 1727,1735 **** GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
if (proc->pid == 0)
continue;
- if (skipExistingConflicts && proc->recoveryConflictPending)
- continue;
-
if (!OidIsValid(dbOid) ||
proc->databaseId == dbOid)
{
--- 1725,1730 ----
***************
*** 1886,1892 **** CountDBBackends(Oid databaseid)
* CancelDBBackends --- cancel backends that are using specified database
*/
void
! CancelDBBackends(Oid databaseid)
{
ProcArrayStruct *arrayP = procArray;
int index;
--- 1881,1887 ----
* CancelDBBackends --- cancel backends that are using specified database
*/
void
! CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
{
ProcArrayStruct *arrayP = procArray;
int index;
***************
*** 1899,1911 **** CancelDBBackends(Oid databaseid)
{
volatile PGPROC *proc = arrayP->procs[index];
! if (proc->databaseId == databaseid)
{
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
! proc->recoveryConflictPending = true;
pid = proc->pid;
if (pid != 0)
{
--- 1894,1906 ----
{
volatile PGPROC *proc = arrayP->procs[index];
! if (databaseid == InvalidOid || proc->databaseId == databaseid)
{
VirtualTransactionId procvxid;
GET_VXID_FROM_PGPROC(procvxid, *proc);
! proc->recoveryConflictPending = conflictPending;
pid = proc->pid;
if (pid != 0)
{
***************
*** 1913,1920 **** CancelDBBackends(Oid databaseid)
* Kill the pid if it's still here. If not, that's what we wanted
* so ignore any errors.
*/
! (void) SendProcSignal(pid, PROCSIG_RECOVERY_CONFLICT_DATABASE,
! procvxid.backendId);
}
}
}
--- 1908,1914 ----
* Kill the pid if it's still here. If not, that's what we wanted
* so ignore any errors.
*/
! (void) SendProcSignal(pid, sigmode, procvxid.backendId);
}
}
}
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 272,276 **** procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 272,279 ----
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+ if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
+ RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+
errno = save_errno;
}
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 126,135 **** WaitExceedsMaxStandbyDelay(void)
long delay_secs;
int delay_usecs;
- /* max_standby_delay = -1 means wait forever, if necessary */
- if (MaxStandbyDelay < 0)
- return false;
-
/* Are we past max_standby_delay? */
TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(),
&delay_secs, &delay_usecs);
--- 126,131 ----
***************
*** 241,248 **** ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid)
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(latestRemovedXid,
! InvalidOid,
! true);
ResolveRecoveryConflictWithVirtualXIDs(backends,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
--- 237,243 ----
VirtualTransactionId *backends;
backends = GetConflictingVirtualXIDs(latestRemovedXid,
! InvalidOid);
ResolveRecoveryConflictWithVirtualXIDs(backends,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
***************
*** 273,280 **** ResolveRecoveryConflictWithTablespace(Oid tsid)
* non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid,
! false);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
}
--- 268,274 ----
* non-transactional.
*/
temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid);
ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
}
***************
*** 295,301 **** ResolveRecoveryConflictWithDatabase(Oid dbid)
*/
while (CountDBBackends(dbid) > 0)
{
! CancelDBBackends(dbid);
/*
* Wait awhile for them to die so that we avoid flooding an
--- 289,295 ----
*/
while (CountDBBackends(dbid) > 0)
{
! CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, true);
/*
* Wait awhile for them to die so that we avoid flooding an
***************
*** 331,338 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
else
{
backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid,
! true);
report_memory_error = true;
}
--- 325,331 ----
else
{
backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! InvalidOid);
report_memory_error = true;
}
***************
*** 346,351 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
--- 339,451 ----
}
/*
+ * ResolveRecoveryConflictWithBufferPin is called from LockBufferForCleanup()
+ * to resolve conflicts with other backends holding buffer pins.
+ *
+ * We either resolve conflicts immediately or set a SIGALRM to wake us at
+ * the limit of our patience. The sleep in LockBufferForCleanup() is
+ * performed here, for code clarity.
+ *
+ * Resolve conflict by sending a SIGUSR1 reason to all backends to check if
+ * they hold one of the buffer pins that is blocking Startup process. If so,
+ * backends will take an appropriate error action, ERROR or FATAL.
+ *
+ * A secondary purpose of this is to avoid deadlocks that might occur between
+ * the Startup process and lock waiters. Deadlocks occur because if queries
+ * wait on a lock, that must be behind an AccessExclusiveLock, which can only
+ * be clared if the Startup process replays a transaction completion record.
+ * If Startup process is waiting then that is a deadlock. If we allowed a
+ * setting of max_standby_delay that meant "wait forever" we would then need
+ * special code to protect against deadlock. Such deadlocks are rare, so the
+ * code would be almost certainly buggy, so we avoid both long waits and
+ * deadlocks using the same mechanism.
+ */
+ void
+ ResolveRecoveryConflictWithBufferPin(void)
+ {
+ bool sig_alarm_enabled = false;
+
+ Assert(InHotStandby);
+
+ /*
+ * Signal immediately or set alarm for later.
+ */
+ if (MaxStandbyDelay == 0)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ TimestampTz now;
+ long standby_delay_secs; /* How far Startup process is lagging */
+ int standby_delay_usecs;
+
+ now = GetCurrentTimestamp();
+
+ /* Are we past max_standby_delay? */
+ TimestampDifference(GetLatestXLogTime(), now,
+ &standby_delay_secs, &standby_delay_usecs);
+
+ if (standby_delay_secs >= (long) MaxStandbyDelay)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ TimestampTz fin_time; /* Expected wake-up time by timer */
+ long timer_delay_secs; /* Amount of time we set timer for */
+ int timer_delay_usecs = 0;
+
+ /*
+ * How much longer we should wait?
+ */
+ timer_delay_secs = MaxStandbyDelay - standby_delay_secs;
+ if (standby_delay_usecs > 0)
+ {
+ timer_delay_secs -= 1;
+ timer_delay_usecs = 1000000 - standby_delay_usecs;
+ }
+
+ /*
+ * It's possible that the difference is less than a microsecond;
+ * ensure we don't cancel, rather than set, the interrupt.
+ */
+ if (timer_delay_secs == 0 && timer_delay_usecs == 0)
+ timer_delay_usecs = 1;
+
+ /*
+ * When is the finish time? We recheck this if we are woken early.
+ */
+ fin_time = TimestampTzPlusMilliseconds(now,
+ (timer_delay_secs * 1000) +
+ (timer_delay_usecs / 1000));
+
+ if (enable_standby_sig_alarm(timer_delay_secs, timer_delay_usecs, fin_time))
+ sig_alarm_enabled = true;
+ else
+ elog(FATAL, "could not set timer for process wakeup");
+ }
+ }
+
+ /* Wait to be signaled by UnpinBuffer() */
+ ProcWaitForSignal();
+
+ if (sig_alarm_enabled)
+ {
+ if (!disable_standby_sig_alarm())
+ elog(FATAL, "could not disable timer for process wakeup");
+ }
+ }
+
+ void
+ SendRecoveryConflictWithBufferPin(void)
+ {
+ /*
+ * We send signal to all backends to ask them if they are holding
+ * the buffer pin which is delaying the Startup process. We must
+ * not set the conflict flag yet, since most backends will be innocent.
+ * Let the SIGUSR1 handling in each backend decide their own fate.
+ */
+ CancelDBBackends(InvalidOid, PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, false);
+ }
+
+ /*
* -----------------------------------------------------
* Locking in Recovery Mode
* -----------------------------------------------------
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 815,839 **** LockAcquireExtended(const LOCKTAG *locktag,
}
/*
- * In Hot Standby we abort the lock wait if Startup process is waiting
- * since this would result in a deadlock. The deadlock occurs because
- * if we are waiting it must be behind an AccessExclusiveLock, which
- * can only clear when a transaction completion record is replayed.
- * If Startup process is waiting we never will clear that lock, so to
- * wait for it just causes a deadlock.
- */
- if (RecoveryInProgress() && !InRecovery &&
- locktag->locktag_type == LOCKTAG_RELATION)
- {
- LWLockRelease(partitionLock);
- ereport(ERROR,
- (errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
- errmsg("possible deadlock detected"),
- errdetail("process conflicts with recovery - please resubmit query later"),
- errdetail_log("process conflicts with recovery")));
- }
-
- /*
* Set bitmask of locks this process already holds on this object.
*/
MyProc->heldLocks = proclock->holdMask;
--- 815,820 ----
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 73,78 **** NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
--- 73,79 ----
static LOCALLOCK *lockAwaited = NULL;
/* Mark these volatile because they can be changed by signal handler */
+ static volatile bool standby_timeout_active = false;
static volatile bool statement_timeout_active = false;
static volatile bool deadlock_timeout_active = false;
static volatile DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
***************
*** 89,94 **** static void RemoveProcFromArray(int code, Datum arg);
--- 90,96 ----
static void ProcKill(int code, Datum arg);
static void AuxiliaryProcKill(int code, Datum arg);
static bool CheckStatementTimeout(void);
+ static bool CheckStandbyTimeout(void);
/*
***************
*** 107,112 **** ProcGlobalShmemSize(void)
--- 109,116 ----
size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
/* ProcStructLock */
size = add_size(size, sizeof(slock_t));
+ /* startupBufferPinWaitBufId */
+ size = add_size(size, sizeof(NBuffers));
return size;
}
***************
*** 487,497 **** PublishStartupProcessInformation(void)
--- 491,534 ----
procglobal->startupProc = MyProc;
procglobal->startupProcPid = MyProcPid;
+ procglobal->startupBufferPinWaitBufId = 0;
SpinLockRelease(ProcStructLock);
}
/*
+ * Used from bufgr to share the value of the buffer that Startup waits on,
+ * or to reset the value to "not waiting" (-1). This allows processing
+ * of recovery conflicts for buffer pins. Set is made before backends look
+ * at this value, so locking not required, especially since the set is
+ * an atomic integer set operation.
+ */
+ void
+ SetStartupBufferPinWaitBufId(int bufid)
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile PROC_HDR *procglobal = ProcGlobal;
+
+ procglobal->startupBufferPinWaitBufId = bufid;
+ }
+
+ /*
+ * Used by backends when they receive a request to check for buffer pin waits.
+ */
+ int
+ GetStartupBufferPinWaitBufId(void)
+ {
+ int bufid;
+
+ /* use volatile pointer to prevent code rearrangement */
+ volatile PROC_HDR *procglobal = ProcGlobal;
+
+ bufid = procglobal->startupBufferPinWaitBufId;
+
+ return bufid;
+ }
+
+ /*
* Check whether there are at least N free PGPROC objects.
*
* Note: this is designed on the assumption that N will generally be small.
***************
*** 1542,1548 **** CheckStatementTimeout(void)
/*
! * Signal handler for SIGALRM
*
* Process deadlock check and/or statement timeout check, as needed.
* To avoid various edge cases, we must be careful to do nothing
--- 1579,1585 ----
/*
! * Signal handler for SIGALRM for normal user backends
*
* Process deadlock check and/or statement timeout check, as needed.
* To avoid various edge cases, we must be careful to do nothing
***************
*** 1565,1567 **** handle_sig_alarm(SIGNAL_ARGS)
--- 1602,1713 ----
errno = save_errno;
}
+
+ /*
+ * Signal handler for SIGALRM in Startup process
+ *
+ * To avoid various edge cases, we must be careful to do nothing
+ * when there is nothing to be done. We also need to be able to
+ * reschedule the timer interrupt if called before end of statement.
+ */
+ bool
+ enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time)
+ {
+ struct itimerval timeval;
+
+ Assert(delay_s >= 0 && delay_us >= 0);
+
+ statement_fin_time = fin_time;
+
+ standby_timeout_active = true;
+
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ timeval.it_value.tv_sec = delay_s;
+ timeval.it_value.tv_usec = delay_us;
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ return false;
+ return true;
+ }
+
+ bool
+ disable_standby_sig_alarm(void)
+ {
+ /*
+ * Always disable the interrupt if it is active; this avoids being
+ * interrupted by the signal handler and thereby possibly getting
+ * confused.
+ *
+ * We will re-enable the interrupt if necessary in CheckStandbyTimeout.
+ */
+ if (standby_timeout_active)
+ {
+ struct itimerval timeval;
+
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ {
+ standby_timeout_active = false;
+ return false;
+ }
+ }
+
+ standby_timeout_active = false;
+
+ return true;
+ }
+
+ /*
+ * CheckStandbyTimeout() runs unconditionally in the Startup process
+ * SIGALRM handler. Timers will only be set when InHotStandby.
+ * We simply ignore any signals unless the timer has been set.
+ */
+ static bool
+ CheckStandbyTimeout(void)
+ {
+ TimestampTz now;
+
+ standby_timeout_active = false;
+
+ now = GetCurrentTimestamp();
+
+ if (now >= statement_fin_time)
+ SendRecoveryConflictWithBufferPin();
+ else
+ {
+ /* Not time yet, so (re)schedule the interrupt */
+ long secs;
+ int usecs;
+ struct itimerval timeval;
+
+ TimestampDifference(now, statement_fin_time,
+ &secs, &usecs);
+
+ /*
+ * It's possible that the difference is less than a microsecond;
+ * ensure we don't cancel, rather than set, the interrupt.
+ */
+ if (secs == 0 && usecs == 0)
+ usecs = 1;
+
+ standby_timeout_active = true;
+
+ MemSet(&timeval, 0, sizeof(struct itimerval));
+ timeval.it_value.tv_sec = secs;
+ timeval.it_value.tv_usec = usecs;
+ if (setitimer(ITIMER_REAL, &timeval, NULL))
+ return false;
+ }
+
+ return true;
+ }
+
+ void
+ handle_standby_sig_alarm(SIGNAL_ARGS)
+ {
+ int save_errno = errno;
+
+ if (standby_timeout_active)
+ (void) CheckStandbyTimeout();
+
+ errno = save_errno;
+ }
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2718,2723 **** RecoveryConflictInterrupt(ProcSignalReason reason)
--- 2718,2735 ----
{
switch (reason)
{
+ case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ /*
+ * If we aren't blocking the Startup process there is
+ * nothing more to do.
+ */
+ if (!HoldingBufferPinThatDelaysRecovery())
+ return;
+
+ MyProc->recoveryConflictPending = true;
+
+ /* Intentional drop through to error handling */
+
case PROCSIG_RECOVERY_CONFLICT_LOCK:
case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1383,1389 **** static struct config_int ConfigureNamesInt[] =
NULL
},
&MaxStandbyDelay,
! 30, -1, INT_MAX, NULL, NULL
},
{
--- 1383,1389 ----
NULL
},
&MaxStandbyDelay,
! 30, 0, INT_MAX, NULL, NULL
},
{
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 198,203 **** extern void LockBuffer(Buffer buffer, int mode);
--- 198,204 ----
extern bool ConditionalLockBuffer(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+ extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 16,22 ****
#include "storage/lock.h"
#include "storage/pg_sema.h"
!
/*
* Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
--- 16,22 ----
#include "storage/lock.h"
#include "storage/pg_sema.h"
! #include "utils/timestamp.h"
/*
* Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
***************
*** 145,150 **** typedef struct PROC_HDR
--- 145,152 ----
/* The proc of the Startup process, since not in ProcArray */
PGPROC *startupProc;
int startupProcPid;
+ /* Buffer id of the buffer that Startup process waits for pin on */
+ int startupBufferPinWaitBufId;
} PROC_HDR;
/*
***************
*** 177,182 **** extern void InitProcessPhase2(void);
--- 179,186 ----
extern void InitAuxiliaryProcess(void);
extern void PublishStartupProcessInformation(void);
+ extern void SetStartupBufferPinWaitBufId(int bufid);
+ extern int GetStartupBufferPinWaitBufId(void);
extern bool HaveNFreeProcs(int n);
extern void ProcReleaseLocks(bool isCommit);
***************
*** 194,197 **** extern bool enable_sig_alarm(int delayms, bool is_statement_timeout);
--- 198,205 ----
extern bool disable_sig_alarm(bool is_statement_timeout);
extern void handle_sig_alarm(SIGNAL_ARGS);
+ extern bool enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time);
+ extern bool disable_standby_sig_alarm(void);
+ extern void handle_standby_sig_alarm(SIGNAL_ARGS);
+
#endif /* PROC_H */
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 57,69 **** extern bool IsBackendPid(int pid);
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
bool excludeXmin0, bool allDbs, int excludeVacuum,
int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin,
! Oid dbOid, bool skipExistingConflicts);
extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
extern int CountActiveBackends(void);
extern int CountDBBackends(Oid databaseid);
! extern void CancelDBBackends(Oid databaseid);
extern int CountUserBackends(Oid roleid);
extern bool CountOtherDBBackends(Oid databaseId,
int *nbackends, int *nprepared);
--- 57,68 ----
extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
bool excludeXmin0, bool allDbs, int excludeVacuum,
int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
extern int CountActiveBackends(void);
extern int CountDBBackends(Oid databaseid);
! extern void CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending);
extern int CountUserBackends(Oid roleid);
extern bool CountOtherDBBackends(Oid databaseId,
int *nbackends, int *nprepared);
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 37,42 **** typedef enum
--- 37,43 ----
PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
PROCSIG_RECOVERY_CONFLICT_LOCK,
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+ PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
NUM_PROCSIGNALS /* Must be last! */
} ProcSignalReason;
*** a/src/include/storage/standby.h
--- b/src/include/storage/standby.h
***************
*** 19,30 ****
extern int vacuum_defer_cleanup_age;
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
! extern void InitRecoveryTransactionEnvironment(void);
! extern void ShutdownRecoveryTransactionEnvironment(void);
/*
* Standby Rmgr (RM_STANDBY_ID)
--- 19,33 ----
extern int vacuum_defer_cleanup_age;
+ extern void InitRecoveryTransactionEnvironment(void);
+ extern void ShutdownRecoveryTransactionEnvironment(void);
+
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
! extern void ResolveRecoveryConflictWithBufferPin(void);
! extern void SendRecoveryConflictWithBufferPin(void);
/*
* Standby Rmgr (RM_STANDBY_ID)
On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I realize its way too late in the cycle for that, but why dont we start
using some library for easy cross platform atomic ops?(1) there probably isn't one that does exactly what we want, works
everywhere, and has the right license;
(2) what actual gain would we get? We've already done the work.
Another thing would be to have a simple rmb() wmb() instruction...
Andres