Race condition in HEAD, possibly due to PGPROC splitup

Started by Tom Laneabout 14 years ago9 messages
#1Tom Lane
tgl@sss.pgh.pa.us

If you add this Assert to lock.c:

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3ba4671..d9c15e0 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** GetRunningTransactionLocks(int *nlocks)
*** 3195,3200 ****
--- 3195,3202 ----
  			accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;
  			accessExclusiveLocks[index].relOid = lock->tag.locktag_field2;
+ 			Assert(TransactionIdIsNormal(accessExclusiveLocks[index].xid));
+ 
  			index++;
  		}
  	}

then set wal_level = hot_standby, and run the regression tests
repeatedly, the Assert will trigger eventually --- for me, it happens
within a dozen or so parallel iterations, or rather longer if I run
the tests serial style. Stack trace is unsurprising, since AFAIK this
is only called in the checkpointer:

#2 0x000000000073461d in ExceptionalCondition (
conditionName=<value optimized out>, errorType=<value optimized out>,
fileName=<value optimized out>, lineNumber=<value optimized out>)
at assert.c:57
#3 0x000000000065eca1 in GetRunningTransactionLocks (nlocks=0x7fffa997de8c)
at lock.c:3198
#4 0x00000000006582b8 in LogStandbySnapshot (nextXid=0x7fffa997dee0)
at standby.c:835
#5 0x00000000004b0b97 in CreateCheckPoint (flags=32) at xlog.c:7761
#6 0x000000000062bf92 in CheckpointerMain () at checkpointer.c:488
#7 0x00000000004cf465 in AuxiliaryProcessMain (argc=2, argv=0x7fffa997e110)
at bootstrap.c:424
#8 0x00000000006261f5 in StartChildProcess (type=CheckpointerProcess)
at postmaster.c:4487

The actual value of the bogus xid (which was pulled from
allPgXact[proc->pgprocno]->xid just above here) is zero. What I believe
is happening is that somebody is clearing his pgxact->xid entry
asynchronously to GetRunningTransactionLocks, and since that clearly
oughta be impossible, something is broken.

Without the added assert, you'd only notice this if you were running a
standby slave --- the zero xid results in an assert failure in WAL
replay on the slave end, which is how I found out about this to start
with. But since we've not heard reports of such before, I suspect that
this is a recently introduced bug; and personally I'd bet money that it
was the PGXACT patch that broke it.

I have other things to do than look into this right now myself.

regards, tom lane

#2Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#1)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Wed, Dec 14, 2011 at 4:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Without the added assert, you'd only notice this if you were running a
standby slave --- the zero xid results in an assert failure in WAL
replay on the slave end, which is how I found out about this to start
with.  But since we've not heard reports of such before, I suspect that
this is a recently introduced bug; and personally I'd bet money that it
was the PGXACT patch that broke it.

I can reproduce this and will look at it in detail.

BTW, on an unrelated note, I was looking at the way ShmemInitStruct()
is used. It internally uses ShmemAlloc to allocate from shared memory.
ShmemAlloc always MAXALIGN the requested size. But while calculating
the total shared memory requirement, we don't always MAXALIGN
individual structure sizes. One example is KnownAssignedXidsValid, but
I am sure there are other places where the originally computed and the
requested sizes could be different because of alignment.

I wonder if we are just lucky not to hit shared memory size mismatch,
may be because we round up to the block size at the end.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#3Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Pavan Deolasee (#2)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Wed, Dec 14, 2011 at 12:20 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Wed, Dec 14, 2011 at 4:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Without the added assert, you'd only notice this if you were running a
standby slave --- the zero xid results in an assert failure in WAL
replay on the slave end, which is how I found out about this to start
with.  But since we've not heard reports of such before, I suspect that
this is a recently introduced bug; and personally I'd bet money that it
was the PGXACT patch that broke it.

I can reproduce this and will look at it in detail.

Looking at CommitTransaction(), it seems quite clear to me that we
call ProcArrayEndTransaction() before releasing the locks held by the
transaction. So its quite possible that when
GetRunningTransactionLocks goes through the list of currently held
locks, the pgxact->xid is already cleared. This seems to a old bug to
me and not related to PGXACT work.

In fact, I can force the assertion failure by braking into gdb and
pausing the process running the following statements, right after it
clears the xid by calling ProcArrayEndTransaction(). At that point, if
I hit CHECKPOINT from another session, the assertion fails.

Session 1:
BEGIN;
LOCK TABLE test IN ACCESS EXCLUSIVE MODE;
COMMIT; ==> break after ProcArrayEndTransaction finishes

Session 2:
CHECKPOINT;

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavan Deolasee (#2)
Re: Race condition in HEAD, possibly due to PGPROC splitup

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

BTW, on an unrelated note, I was looking at the way ShmemInitStruct()
is used. It internally uses ShmemAlloc to allocate from shared memory.
ShmemAlloc always MAXALIGN the requested size. But while calculating
the total shared memory requirement, we don't always MAXALIGN
individual structure sizes. One example is KnownAssignedXidsValid, but
I am sure there are other places where the originally computed and the
requested sizes could be different because of alignment.

I wonder if we are just lucky not to hit shared memory size mismatch,
may be because we round up to the block size at the end.

That sort of thing is down in the noise. One reason we throw in the
100KB slop is so we don't have to sweat details like that.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavan Deolasee (#3)
Re: Race condition in HEAD, possibly due to PGPROC splitup

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Looking at CommitTransaction(), it seems quite clear to me that we
call ProcArrayEndTransaction() before releasing the locks held by the
transaction. So its quite possible that when
GetRunningTransactionLocks goes through the list of currently held
locks, the pgxact->xid is already cleared. This seems to a old bug to
me and not related to PGXACT work.

Hm. So maybe the correct fix is to deem the lock already released
if we get zero when we read the xid? It's not clear to me what the
requirements for GetRunningTransactionLocks actually are, but if it's
okay for it to think a lock is released slightly ahead of when the
rest of the system thinks so, that would work.

regards, tom lane

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Wed, Dec 14, 2011 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Looking at CommitTransaction(), it seems quite clear to me that we
call ProcArrayEndTransaction() before releasing the locks held by the
transaction. So its quite possible that when
GetRunningTransactionLocks goes through the list of currently held
locks, the pgxact->xid is already cleared. This seems to a old bug to
me and not related to PGXACT work.

Hm.  So maybe the correct fix is to deem the lock already released
if we get zero when we read the xid?  It's not clear to me what the
requirements for GetRunningTransactionLocks actually are, but if it's
okay for it to think a lock is released slightly ahead of when the
rest of the system thinks so, that would work.

OK, I'll look at this.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Wed, Dec 14, 2011 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Looking at CommitTransaction(), it seems quite clear to me that we
call ProcArrayEndTransaction() before releasing the locks held by the
transaction. So its quite possible that when
GetRunningTransactionLocks goes through the list of currently held
locks, the pgxact->xid is already cleared. This seems to a old bug to
me and not related to PGXACT work.

Hm.  So maybe the correct fix is to deem the lock already released
if we get zero when we read the xid?  It's not clear to me what the
requirements for GetRunningTransactionLocks actually are, but if it's
okay for it to think a lock is released slightly ahead of when the
rest of the system thinks so, that would work.

Attached patch closes both race conditions:
* where xid is zero
* where xid is non-zero yet WAL record for the commit of xid wins race
ahead of the WAL record for locks

Patch fixes it in backwards compatible way.

No version increments.

Patch head, 9.1 and 9.0.

Will wait a couple of days for additional testing.

Comments?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hs_release_old_locks.v2.patchtext/x-patch; charset=US-ASCII; name=hs_release_old_locks.v2.patchDownload
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 499,505 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 * Remove stale transactions, if any.
  	 */
  	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
! 	StandbyReleaseOldLocks(running->oldestRunningXid);
  
  	/*
  	 * If our snapshot is already valid, nothing else to do...
--- 499,505 ----
  	 * Remove stale transactions, if any.
  	 */
  	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
! 	StandbyReleaseOldLocks(running->xcnt, running->xids);
  
  	/*
  	 * If our snapshot is already valid, nothing else to do...
***************
*** 554,565 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 */
  
  	/*
- 	 * Release any locks belonging to old transactions that are not running
- 	 * according to the running-xacts record.
- 	 */
- 	StandbyReleaseOldLocks(running->nextXid);
- 
- 	/*
  	 * Nobody else is running yet, but take locks anyhow
  	 */
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
--- 554,559 ----
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 525,531 **** StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
  	LOCKTAG		locktag;
  
  	/* Already processed? */
! 	if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  		return;
  
  	elog(trace_recovery(DEBUG4),
--- 525,533 ----
  	LOCKTAG		locktag;
  
  	/* Already processed? */
! 	if (!TransactionIdIsValid(xid) ||
! 		TransactionIdDidCommit(xid) ||
! 		TransactionIdDidAbort(xid))
  		return;
  
  	elog(trace_recovery(DEBUG4),
***************
*** 607,640 **** StandbyReleaseLockTree(TransactionId xid, int nsubxids, TransactionId *subxids)
  }
  
  /*
!  * StandbyReleaseLocksMany
!  *		Release standby locks held by XIDs < removeXid
!  *
!  * If keepPreparedXacts is true, keep prepared transactions even if
!  * they're older than removeXid
   */
! static void
! StandbyReleaseLocksMany(TransactionId removeXid, bool keepPreparedXacts)
  {
  	ListCell   *cell,
  			   *prev,
  			   *next;
  	LOCKTAG		locktag;
  
- 	/*
- 	 * Release all matching locks.
- 	 */
  	prev = NULL;
  	for (cell = list_head(RecoveryLockList); cell; cell = next)
  	{
  		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
  
  		next = lnext(cell);
  
! 		if (!TransactionIdIsValid(removeXid) || TransactionIdPrecedes(lock->xid, removeXid))
  		{
- 			if (keepPreparedXacts && StandbyTransactionIdIsPrepared(lock->xid))
- 				continue;
  			elog(trace_recovery(DEBUG4),
  				 "releasing recovery lock: xid %u db %u rel %u",
  				 lock->xid, lock->dbOid, lock->relOid);
--- 609,691 ----
  }
  
  /*
!  * Called at end of recovery and when we see a shutdown checkpoint.
   */
! void
! StandbyReleaseAllLocks(void)
! {
! 	ListCell   *cell,
! 			   *prev,
! 			   *next;
! 	LOCKTAG		locktag;
! 
! 	elog(trace_recovery(DEBUG2), "release all standby locks");
! 
! 	prev = NULL;
! 	for (cell = list_head(RecoveryLockList); cell; cell = next)
! 	{
! 		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
! 
! 		next = lnext(cell);
! 
! 		elog(trace_recovery(DEBUG4),
! 			 "releasing recovery lock: xid %u db %u rel %u",
! 			 lock->xid, lock->dbOid, lock->relOid);
! 		SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
! 		if (!LockRelease(&locktag, AccessExclusiveLock, true))
! 			elog(LOG,
! 				 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
! 				 lock->xid, lock->dbOid, lock->relOid);
! 		RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
! 		pfree(lock);
! 	}
! }
! 
! /*
!  * StandbyReleaseOldLocks
!  *		Release standby locks held by XIDs that aren't running, as long
!  *		as they're not prepared transactions.
!  */
! void
! StandbyReleaseOldLocks(int nxids, TransactionId *xids)
  {
  	ListCell   *cell,
  			   *prev,
  			   *next;
  	LOCKTAG		locktag;
  
  	prev = NULL;
  	for (cell = list_head(RecoveryLockList); cell; cell = next)
  	{
  		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+ 		bool	remove = false;
  
  		next = lnext(cell);
  
! 		Assert(TransactionIdIsValid(lock->xid));
! 
! 		if (StandbyTransactionIdIsPrepared(lock->xid))
! 			remove = false;
! 		else
! 		{
! 			int		i;
! 			bool	found = false;
! 
! 			for (i = 0; i < nxids; i++)
! 			{
! 				if (lock->xid == xids[i])
! 					found = true;
! 			}
! 
! 			/*
! 			 * If its not a running transaction, remove it.
! 			 */
! 			if (!found)
! 				remove = true;
! 		}
! 
! 		if (remove)
  		{
  			elog(trace_recovery(DEBUG4),
  				 "releasing recovery lock: xid %u db %u rel %u",
  				 lock->xid, lock->dbOid, lock->relOid);
***************
*** 652,678 **** StandbyReleaseLocksMany(TransactionId removeXid, bool keepPreparedXacts)
  }
  
  /*
-  * Called at end of recovery and when we see a shutdown checkpoint.
-  */
- void
- StandbyReleaseAllLocks(void)
- {
- 	elog(trace_recovery(DEBUG2), "release all standby locks");
- 	StandbyReleaseLocksMany(InvalidTransactionId, false);
- }
- 
- /*
-  * StandbyReleaseOldLocks
-  *		Release standby locks held by XIDs < removeXid, as long
-  *		as they're not prepared transactions.
-  */
- void
- StandbyReleaseOldLocks(TransactionId removeXid)
- {
- 	StandbyReleaseLocksMany(removeXid, true);
- }
- 
- /*
   * --------------------------------------------------------------------
   *		Recovery handling for Rmgr RM_STANDBY_ID
   *
--- 703,708 ----
***************
*** 813,818 **** standby_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 843,855 ----
   * Later, when we apply the running xact data we must be careful to ignore
   * transactions already committed, since those commits raced ahead when
   * making WAL entries.
+  *
+  * The loose timing also means that locks may be recorded that have a
+  * zero xid, since xids are removed from procs before locks are removed.
+  * So we must prune the lock list down to ensure we hold locks only for
+  * currently running xids, performed by StandbyReleaseOldLocks().
+  * Zero xids should no longer be possible, but we may be replaying WAL
+  * from a time when they were possible.
   */
  void
  LogStandbySnapshot(TransactionId *nextXid)
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 3190,3195 **** GetRunningTransactionLocks(int *nlocks)
--- 3190,3205 ----
  			PGPROC	   *proc = proclock->tag.myProc;
  			PGXACT	   *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
  			LOCK	   *lock = proclock->tag.myLock;
+ 			TransactionId xid = pgxact->xid;
+ 
+ 			/*
+ 			 * Don't record locks for transactions if we know they have already
+ 			 * issued their WAL record for commit but not yet released lock.
+ 			 * It is still possible that we see locks held by already complete
+ 			 * transactions, if they haven't yet zeroed their xids.
+ 			 */
+ 			if (!TransactionIdIsValid(xid))
+ 				continue;
  
  			accessExclusiveLocks[index].xid = pgxact->xid;
  			accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;
*** a/src/include/storage/standby.h
--- b/src/include/storage/standby.h
***************
*** 48,54 **** extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid
  extern void StandbyReleaseLockTree(TransactionId xid,
  					   int nsubxids, TransactionId *subxids);
  extern void StandbyReleaseAllLocks(void);
! extern void StandbyReleaseOldLocks(TransactionId removeXid);
  
  /*
   * XLOG message types
--- 48,54 ----
  extern void StandbyReleaseLockTree(TransactionId xid,
  					   int nsubxids, TransactionId *subxids);
  extern void StandbyReleaseAllLocks(void);
! extern void StandbyReleaseOldLocks(int nxids, TransactionId *xids);
  
  /*
   * XLOG message types
#8Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#7)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Thu, Dec 15, 2011 at 8:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Comments?

I think there is a small bug here:

+ 			TransactionId xid = pgxact->xid;
+
+ 			/*
+ 			 * Don't record locks for transactions if we know they have already
+ 			 * issued their WAL record for commit but not yet released lock.
+ 			 * It is still possible that we see locks held by already complete
+ 			 * transactions, if they haven't yet zeroed their xids.
+ 			 */
+ 			if (!TransactionIdIsValid(xid))
+ 				continue;

accessExclusiveLocks[index].xid = pgxact->xid;
accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;

...because you're fetching pgxact->xid, testing whether it's valid,
and then fetching it again. It could change in the meantime. You
probably want to change the assignment to read:

accessExclusiveLocks[index].xid = xid;

Also, we should probably change this pointer to be declared volatile:

PGXACT *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];

Otherwise, I think the compiler might get cute and decide to fetch the
xid twice anyway, effectively undoing our attempt to pull it into a
local variable.

I think this comment could be clarified in some way, to make it more
clear that we had a bug at one point, and it was fixed - the "from a
time when they were possible" language wouldn't be entirely clear to
me after the fact:

+  * Zero xids should no longer be possible, but we may be replaying WAL
+  * from a time when they were possible.

It would probably make sense to break out of this loop if a match is found:

! for (i = 0; i < nxids; i++)
! {
! if (lock->xid == xids[i])
! found = true;
! }

I'm not sure I fully understand the rest of this, but those are the
only things I noticed on a read-through.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#8)
Re: Race condition in HEAD, possibly due to PGPROC splitup

On Thu, Dec 15, 2011 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 15, 2011 at 8:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Comments?

I think there is a small bug here:

+                       TransactionId xid = pgxact->xid;
+
+                       /*
+                        * Don't record locks for transactions if we know they have already
+                        * issued their WAL record for commit but not yet released lock.
+                        * It is still possible that we see locks held by already complete
+                        * transactions, if they haven't yet zeroed their xids.
+                        */
+                       if (!TransactionIdIsValid(xid))
+                               continue;

                       accessExclusiveLocks[index].xid = pgxact->xid;
                       accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;

...because you're fetching pgxact->xid, testing whether it's valid,
and then fetching it again.  It could change in the meantime.  You
probably want to change the assignment to read:

                       accessExclusiveLocks[index].xid = xid;

Also, we should probably change this pointer to be declared volatile:

                       PGXACT     *pgxact = &ProcGlobal->allPgXact[proc->pgprocno];

Otherwise, I think the compiler might get cute and decide to fetch the
xid twice anyway, effectively undoing our attempt to pull it into a
local variable.

I think this comment could be clarified in some way, to make it more
clear that we had a bug at one point, and it was fixed - the "from a
time when they were possible" language wouldn't be entirely clear to
me after the fact:

+  * Zero xids should no longer be possible, but we may be replaying WAL
+  * from a time when they were possible.

It would probably make sense to break out of this loop if a match is found:

!                       for (i = 0; i < nxids; i++)
!                       {
!                               if (lock->xid == xids[i])
!                                       found = true;
!                       }

I'm not sure I fully understand the rest of this, but those are the
only things I noticed on a read-through.

Thanks, useful changes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services