StrategyGetBuffer optimization, take 2

Started by Merlin Moncureover 12 years ago21 messages
#1Merlin Moncure
mmoncure@gmail.com
1 attachment(s)

My $company recently acquired another postgres based $company and
migrated all their server operations into our datacenter. Upon
completing the move, the newly migrated database server started
experiencing huge load spikes.

*) Environment description:
Postgres 9.2.4
RHEL 6
32 cores
virtualized (ESX) but with a dedicated host
256GB ram
shared_buffers: 32G
96 application servers configured to max 5 connections each
very fast i/o
database size: ~ 200GB
HS/SR: 3 slaves

*) Problem description:
The server normally hums along nicely with load < 1.0 and no iowait --
in fact the server is massively over-provisioned. However, on
semi-random basis (once every 1-2 days) load absolutely goes through
the roof to 600+, no iowait, 90-100% (70%+ sys) cpu. It hangs around
like that for 5-20 minutes then resolves as suddenly as it started.
There is nothing interesting going on application side (except the
application servers are all piling on) but pg_locks is recording lots
of contention on relation 'extension locks'. One interesting point
is that the slaves are also affected, but the precise point of the
high load affects happens some seconds after the master.

*) Initial steps taken:
RhodiumToad aka (Andrew G) has seen this in the wild several times and
suggested dropping shared_buffers significantly might resolve the
situation short term. That was done on friday night, and so far
problem has not re-occurred.

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case. What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

*) what I would like to do to fix it:
see attached patch. This builds on the work of Jeff Janes to remove
the free list lock and has some extra optimizations in the clock sweep
loop:

optimization 1: usage count is advisory. it is not updated behind the
buffer lock. in the event there are a large sequences of buffers with

0 usage_count, this avoids spamming the cache_line lock; you

decrement and hope for the best

optimization 2: refcount is examined during buffer allocation without
a lock. if it's > 0, buffer is assumed pinned (even though it may not
in fact be) and sweep continues

optimization 3: sweep does not wait on buf header lock. instead, it
does 'try lock' and bails if the buffer is determined pinned. I
believe this to be one of the two critical optimizations

optimization 4: remove free list lock (via Jeff Janes). This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

*) what I'm asking for

Is the analysis and the patch to fix the perceived problem plausible
without breaking other stuff.. If so, I'm inclined to go further with
this. This is not the only solution on the table for high buffer
contention, but IMNSHO it should get a lot of points for being very
localized. Maybe a reduced version could be tried retaining the
freelist lock but keeping the 'trylock' on the buf header.

*) further reading:
https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;ved=0CC8QFjAA&amp;url=http%3A%2F%2Fpostgresql.1045698.n5.nabble.com%2FHigh-SYS-CPU-need-advise-td5732045.html&amp;ei=hsb_Uc6pB4Ss9ASN7YHoAg&amp;usg=AFQjCNEefMxOvjvW3Alg4TiXqCSAUmDR7A&amp;sig2=EyPOQa9XbVEND5kwzTeBJg&amp;bvm=bv.50165853,d.eWU

/messages/by-id/CAHyXU0x47D4n6EdPyNyadShXQQXKoheLV2cbRgr_2NGrC8KRRQ@mail.gmail.com

http://postgresql.1045698.n5.nabble.com/Page-replacement-algorithm-in-buffer-cache-td5749236.html

merlin

Attachments:

buffer2.patchapplication/octet-stream; name=buffer2.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
new file mode 100755
index 1c41428..699b0ec
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1665,1671 ****
  	 */
  	LockBufHdr(bufHdr);
  
! 	if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
  		result |= BUF_REUSABLE;
  	else if (skip_recently_used)
  	{
--- 1665,1671 ----
  	 */
  	LockBufHdr(bufHdr);
  
! 	if (bufHdr->refcount == 0 && bufHdr->usage_count <= 0)
  		result |= BUF_REUSABLE;
  	else if (skip_recently_used)
  	{
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
new file mode 100755
index c76aaf7..a081069
*** a/src/backend/storage/buffer/freelist.c
--- b/src/backend/storage/buffer/freelist.c
*************** typedef struct
*** 28,34 ****
  	int			nextVictimBuffer;
  
  	int			firstFreeBuffer;	/* Head of list of unused buffers */
! 	int			lastFreeBuffer; /* Tail of list of unused buffers */
  
  	/*
  	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
--- 28,35 ----
  	int			nextVictimBuffer;
  
  	int			firstFreeBuffer;	/* Head of list of unused buffers */
! 	int			lastFreeBuffer;		 /* Tail of list of unused buffers */
! 	slock_t			mutex;			/* Protects all of these, except sometimes not nextVictimBuffer */
  
  	/*
  	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
*************** typedef struct
*** 49,55 ****
  } BufferStrategyControl;
  
  /* Pointers to shared state */
! static BufferStrategyControl *StrategyControl = NULL;
  
  /*
   * Private (non-shared) state for managing a ring of shared buffers to re-use.
--- 50,56 ----
  } BufferStrategyControl;
  
  /* Pointers to shared state */
! static volatile BufferStrategyControl *StrategyControl = NULL;
  
  /*
   * Private (non-shared) state for managing a ring of shared buffers to re-use.
*************** static void AddBufferToRing(BufferAccess
*** 101,112 ****
   *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
   *
   *	To ensure that no one else can pin the buffer before we do, we must
!  *	return the buffer with the buffer header spinlock still held.  If
!  *	*lock_held is set on exit, we have returned with the BufFreelistLock
!  *	still held, as well; the caller must release that lock once the spinlock
!  *	is dropped.  We do it that way because releasing the BufFreelistLock
!  *	might awaken other processes, and it would be bad to do the associated
!  *	kernel calls while holding the buffer header spinlock.
   */
  volatile BufferDesc *
  StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
--- 102,113 ----
   *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
   *
   *	To ensure that no one else can pin the buffer before we do, we must
!  *	return the buffer with the buffer header spinlock still held.  To avoid
!  *  excessive spinning during the clock sweep loop on highly contended
!  *  buffers, the buffer header lock is aquired without spinning.  Also, all
!  *  manipulation of usage_count is done without holding the buffer header
!  *  spinlock on the basis that it advisory; this avoids excessive cache line
!  *  locking while sweeping.
   */
  volatile BufferDesc *
  StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 114,119 ****
--- 115,121 ----
  	volatile BufferDesc *buf;
  	Latch	   *bgwriterLatch;
  	int			trycounter;
+ 	*lock_held = false;
  
  	/*
  	 * If given a strategy object, see whether it can select a buffer. We
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 124,143 ****
  		buf = GetBufferFromRing(strategy);
  		if (buf != NULL)
  		{
- 			*lock_held = false;
  			return buf;
  		}
  	}
  
- 	/* Nope, so lock the freelist */
- 	*lock_held = true;
- 	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- 
  	/*
  	 * We count buffer allocation requests so that the bgwriter can estimate
  	 * the rate of buffer consumption.	Note that buffers recycled by a
  	 * strategy object are intentionally not counted here.
  	 */
  	StrategyControl->numBufferAllocs++;
  
  	/*
--- 126,142 ----
  		buf = GetBufferFromRing(strategy);
  		if (buf != NULL)
  		{
  			return buf;
  		}
  	}
  
  	/*
  	 * We count buffer allocation requests so that the bgwriter can estimate
  	 * the rate of buffer consumption.	Note that buffers recycled by a
  	 * strategy object are intentionally not counted here.
+ 	 * This is now done without a lock, and so updates can be lost
  	 */
+ 
  	StrategyControl->numBufferAllocs++;
  
  	/*
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 150,158 ****
  	if (bgwriterLatch)
  	{
  		StrategyControl->bgwriterLatch = NULL;
- 		LWLockRelease(BufFreelistLock);
  		SetLatch(bgwriterLatch);
- 		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  	}
  
  	/*
--- 149,155 ----
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 163,168 ****
--- 160,171 ----
  	 */
  	while (StrategyControl->firstFreeBuffer >= 0)
  	{
+ 		SpinLockAcquire(&StrategyControl->mutex);
+ 		if (StrategyControl->firstFreeBuffer<0)
+ 		{
+ 			SpinLockRelease(&StrategyControl->mutex);
+ 			break;
+ 		}
  		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
  		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
  
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 170,175 ****
--- 173,180 ----
  		StrategyControl->firstFreeBuffer = buf->freeNext;
  		buf->freeNext = FREENEXT_NOT_IN_LIST;
  
+ 		SpinLockRelease(&StrategyControl->mutex);
+ 
  		/*
  		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
  		 * it; discard it and retry.  (This can only happen if VACUUM put a
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 178,184 ****
  		 * we'd better check anyway.)
  		 */
  		LockBufHdr(buf);
! 		if (buf->refcount == 0 && buf->usage_count == 0)
  		{
  			if (strategy != NULL)
  				AddBufferToRing(strategy, buf);
--- 183,189 ----
  		 * we'd better check anyway.)
  		 */
  		LockBufHdr(buf);
! 		if (buf->refcount == 0 && buf->usage_count <= 0)
  		{
  			if (strategy != NULL)
  				AddBufferToRing(strategy, buf);
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 191,209 ****
  	trycounter = NBuffers;
  	for (;;)
  	{
! 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
  
  		if (++StrategyControl->nextVictimBuffer >= NBuffers)
  		{
! 			StrategyControl->nextVictimBuffer = 0;
! 			StrategyControl->completePasses++;
  		}
  
  		/*
  		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
  		 * it; decrement the usage_count (unless pinned) and keep scanning.
  		 */
- 		LockBufHdr(buf);
  		if (buf->refcount == 0)
  		{
  			if (buf->usage_count > 0)
--- 196,222 ----
  	trycounter = NBuffers;
  	for (;;)
  	{
! 
! 		int localVictim=StrategyControl->nextVictimBuffer;
! 		if (localVictim >= NBuffers) localVictim=0;
! 
! 		buf = &BufferDescriptors[localVictim];
  
  		if (++StrategyControl->nextVictimBuffer >= NBuffers)
  		{
! 			SpinLockAcquire(&StrategyControl->mutex);
! 			if (StrategyControl->nextVictimBuffer >= NBuffers)
! 			{
! 				StrategyControl->nextVictimBuffer = 0;
! 				StrategyControl->completePasses++;
! 			};
! 			SpinLockRelease(&StrategyControl->mutex);
  		}
  
  		/*
  		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
  		 * it; decrement the usage_count (unless pinned) and keep scanning.
  		 */
  		if (buf->refcount == 0)
  		{
  			if (buf->usage_count > 0)
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 213,222 ****
  			}
  			else
  			{
! 				/* Found a usable buffer */
! 				if (strategy != NULL)
! 					AddBufferToRing(strategy, buf);
! 				return buf;
  			}
  		}
  		else if (--trycounter == 0)
--- 226,238 ----
  			}
  			else
  			{
! 				/* Found a usable buffer.  But is it truly unpinned? */
! 				if (TryLockBufHdr(buf))
! 				{
! 					if (strategy != NULL)
! 						AddBufferToRing(strategy, buf);
! 					return buf;
! 				}
  			}
  		}
  		else if (--trycounter == 0)
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 228,237 ****
  			 * probably better to fail than to risk getting stuck in an
  			 * infinite loop.
  			 */
- 			UnlockBufHdr(buf);
  			elog(ERROR, "no unpinned buffers available");
  		}
- 		UnlockBufHdr(buf);
  	}
  }
  
--- 244,251 ----
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 241,247 ****
  void
  StrategyFreeBuffer(volatile BufferDesc *buf)
  {
! 	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  
  	/*
  	 * It is possible that we are told to put something in the freelist that
--- 255,261 ----
  void
  StrategyFreeBuffer(volatile BufferDesc *buf)
  {
! 	SpinLockAcquire(&StrategyControl->mutex);
  
  	/*
  	 * It is possible that we are told to put something in the freelist that
*************** StrategyFreeBuffer(volatile BufferDesc *
*** 255,261 ****
  		StrategyControl->firstFreeBuffer = buf->buf_id;
  	}
  
! 	LWLockRelease(BufFreelistLock);
  }
  
  /*
--- 269,275 ----
  		StrategyControl->firstFreeBuffer = buf->buf_id;
  	}
  
! 	SpinLockRelease(&StrategyControl->mutex);
  }
  
  /*
*************** StrategySyncStart(uint32 *complete_passe
*** 274,281 ****
  {
  	int			result;
  
! 	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  	result = StrategyControl->nextVictimBuffer;
  	if (complete_passes)
  		*complete_passes = StrategyControl->completePasses;
  	if (num_buf_alloc)
--- 288,296 ----
  {
  	int			result;
  
! 	SpinLockAcquire(&StrategyControl->mutex);
  	result = StrategyControl->nextVictimBuffer;
+ 	if (result >= NBuffers) result=0;
  	if (complete_passes)
  		*complete_passes = StrategyControl->completePasses;
  	if (num_buf_alloc)
*************** StrategySyncStart(uint32 *complete_passe
*** 283,289 ****
  		*num_buf_alloc = StrategyControl->numBufferAllocs;
  		StrategyControl->numBufferAllocs = 0;
  	}
! 	LWLockRelease(BufFreelistLock);
  	return result;
  }
  
--- 298,304 ----
  		*num_buf_alloc = StrategyControl->numBufferAllocs;
  		StrategyControl->numBufferAllocs = 0;
  	}
! 	SpinLockRelease(&StrategyControl->mutex);
  	return result;
  }
  
*************** StrategyInitialize(bool init)
*** 376,381 ****
--- 391,397 ----
  		 */
  		StrategyControl->firstFreeBuffer = 0;
  		StrategyControl->lastFreeBuffer = NBuffers - 1;
+ 		SpinLockInit(&StrategyControl->mutex);
  
  		/* Initialize the clock sweep pointer */
  		StrategyControl->nextVictimBuffer = 0;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
new file mode 100755
index f0e5144..943215c
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 168,173 ****
--- 168,174 ----
   */
  #define LockBufHdr(bufHdr)		SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)	SpinLockRelease(&(bufHdr)->buf_hdr_lock)
+ #define TryLockBufHdr(bufHdr)	TrySpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  
  
  /* in buf_init.c */
diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
new file mode 100755
index f459b90..909ecd5
*** a/src/include/storage/spin.h
--- b/src/include/storage/spin.h
***************
*** 63,68 ****
--- 63,70 ----
  
  #define SpinLockAcquire(lock) S_LOCK(lock)
  
+ #define TrySpinLockAcquire(lock) !(TAS_SPIN(lock))
+ 
  #define SpinLockRelease(lock) S_UNLOCK(lock)
  
  #define SpinLockFree(lock)	S_LOCK_FREE(lock)
#2Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#1)
Re: StrategyGetBuffer optimization, take 2

On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:

optimization 4: remove free list lock (via Jeff Janes). This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

I think splitting off the actual freelist checking into a spinlock makes
quite a bit of sense. And I think a separate one should be used for the
actual search for the clock sweep.
I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Atri Sharma
atri.jiit@gmail.com
In reply to: Merlin Moncure (#1)
Re: StrategyGetBuffer optimization, take 2

optimization 2: refcount is examined during buffer allocation without
a lock. if it's > 0, buffer is assumed pinned (even though it may not
in fact be) and sweep continues

+1.

I think this shall not lead to much problems, since a lost update
cannot,IMO, lead to disastrous result. At most, a buffer page can
survive for an extra clock sweep.

optimization 3: sweep does not wait on buf header lock. instead, it
does 'try lock' and bails if the buffer is determined pinned. I
believe this to be one of the two critical optimizations

+1 again. I believe the try lock will be a sort of filter before the
actual check, hence reducing the contention.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: StrategyGetBuffer optimization, take 2

On Mon, Aug 5, 2013 at 11:40 AM, Andres Freund <andres@anarazel.de> wrote:

On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:

optimization 4: remove free list lock (via Jeff Janes). This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

I think splitting off the actual freelist checking into a spinlock makes
quite a bit of sense. And I think a separate one should be used for the
actual search for the clock sweep.

I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen. An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

merlin

Attachments:

buffer3.patchapplication/octet-stream; name=buffer3.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
new file mode 100755
index 1c41428..61fc1e6
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** PinBuffer(volatile BufferDesc *buf, Buff
*** 1105,1111 ****
  		}
  		else
  		{
! 			if (buf->usage_count == 0)
  				buf->usage_count = 1;
  		}
  		result = (buf->flags & BM_VALID) != 0;
--- 1105,1111 ----
  		}
  		else
  		{
! 			if (buf->usage_count <= 0)
  				buf->usage_count = 1;
  		}
  		result = (buf->flags & BM_VALID) != 0;
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1665,1671 ****
  	 */
  	LockBufHdr(bufHdr);
  
! 	if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
  		result |= BUF_REUSABLE;
  	else if (skip_recently_used)
  	{
--- 1665,1671 ----
  	 */
  	LockBufHdr(bufHdr);
  
! 	if (bufHdr->refcount == 0 && bufHdr->usage_count <= 0)
  		result |= BUF_REUSABLE;
  	else if (skip_recently_used)
  	{
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
new file mode 100755
index c76aaf7..faa8523
*** a/src/backend/storage/buffer/freelist.c
--- b/src/backend/storage/buffer/freelist.c
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 178,184 ****
  		 * we'd better check anyway.)
  		 */
  		LockBufHdr(buf);
! 		if (buf->refcount == 0 && buf->usage_count == 0)
  		{
  			if (strategy != NULL)
  				AddBufferToRing(strategy, buf);
--- 178,184 ----
  		 * we'd better check anyway.)
  		 */
  		LockBufHdr(buf);
! 		if (buf->refcount == 0 && buf->usage_count <= 0)
  		{
  			if (strategy != NULL)
  				AddBufferToRing(strategy, buf);
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 201,209 ****
  
  		/*
  		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
! 		 * it; decrement the usage_count (unless pinned) and keep scanning.
  		 */
- 		LockBufHdr(buf);
  		if (buf->refcount == 0)
  		{
  			if (buf->usage_count > 0)
--- 201,211 ----
  
  		/*
  		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
! 		 * it; decrement the usage_count (unless pinned) and keep scanning.  The
! 		 * buffer header is deliberately not locked during the initial examination
! 		 * of refcount or while usage_count is adjusted in order to minimize
! 		 * spinlock contention while holding the BufFreelistLock.
  		 */
  		if (buf->refcount == 0)
  		{
  			if (buf->usage_count > 0)
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 213,222 ****
  			}
  			else
  			{
! 				/* Found a usable buffer */
! 				if (strategy != NULL)
! 					AddBufferToRing(strategy, buf);
! 				return buf;
  			}
  		}
  		else if (--trycounter == 0)
--- 215,227 ----
  			}
  			else
  			{
! 				/* Found a usable buffer. But is it truly unpinned? */
! 				if (TryLockBufHdr(buf))
! 				{
! 					if (strategy != NULL)
! 						AddBufferToRing(strategy, buf);
! 					return buf;
! 				}
  			}
  		}
  		else if (--trycounter == 0)
*************** StrategyGetBuffer(BufferAccessStrategy s
*** 228,237 ****
  			 * probably better to fail than to risk getting stuck in an
  			 * infinite loop.
  			 */
- 			UnlockBufHdr(buf);
  			elog(ERROR, "no unpinned buffers available");
  		}
- 		UnlockBufHdr(buf);
  	}
  }
  
--- 233,240 ----
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
new file mode 100755
index f0e5144..943215c
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 168,173 ****
--- 168,174 ----
   */
  #define LockBufHdr(bufHdr)		SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)	SpinLockRelease(&(bufHdr)->buf_hdr_lock)
+ #define TryLockBufHdr(bufHdr)	TrySpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  
  
  /* in buf_init.c */
diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
new file mode 100755
index f459b90..ed68318
*** a/src/include/storage/spin.h
--- b/src/include/storage/spin.h
***************
*** 63,68 ****
--- 63,70 ----
  
  #define SpinLockAcquire(lock) S_LOCK(lock)
  
+ #define TrySpinLockAcquire(lock)	!(TAS_SPIN(lock))
+ 
  #define SpinLockRelease(lock) S_UNLOCK(lock)
  
  #define SpinLockFree(lock)	S_LOCK_FREE(lock)
#5Andres Freund
andres@2ndquadrant.com
In reply to: Merlin Moncure (#4)
Re: StrategyGetBuffer optimization, take 2

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#5)
Re: StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

all right: well, I'll work up another version doing full spinlock and
maybe see things shake out in performance.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

I have some very strong evidence that the problem is coming out of the
buffer allocator. Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue. This problem shows up on fast
machines with fast storage and lots of cores. So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.
What is not as clear is if the proposed optimizations will fix the
problem -- I'd have to get approval to test and confirm them in
production which seems unlikely at this juncture; that's why I'm
trying to keep things 'win-win' so as to not have to have them be
accepted on that basis.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Atri Sharma
atri.jiit@gmail.com
In reply to: Andres Freund (#5)
Re: StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 10:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

Umm, sorry if I am being naive, but dont spinlocks perform bad when a
lot of contention is present on that node?

I feel that we may hit on that case here. A preliminary check before
the actual spinlock may be good,though,since spinlocks are cheap until
the contention remains low.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila@huawei.com
In reply to: Merlin Moncure (#6)
Re: StrategyGetBuffer optimization, take 2

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Thursday, August 08, 2013 12:09 AM
To: Andres Freund
Cc: PostgreSQL-development; Jeff Janes
Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I don't think the unlocked increment of nextVictimBuffer is a good

idea

though. nextVictimBuffer jumping over NBuffers under concurrency

seems

like a recipe for disaster to me. At the very, very least it will

need a

good wad of comments explaining what it means and how you're

allowed to

use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a

spinlock

it's not clear if you're improving things; it would have to be

proven

and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

all right: well, I'll work up another version doing full spinlock and
maybe see things shake out in performance.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

I have some very strong evidence that the problem is coming out of the
buffer allocator. Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue. This problem shows up on fast
machines with fast storage and lots of cores. So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.

Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned.
I am also working on some of the optimizations on similar area, which you
can refer here:

/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@
huawei.com

What is not as clear is if the proposed optimizations will fix the
problem -- I'd have to get approval to test and confirm them in
production which seems unlikely at this juncture; that's why I'm
trying to keep things 'win-win' so as to not have to have them be
accepted on that basis.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#1)
Re: StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Thursday, August 08, 2013 12:09 AM
To: Andres Freund
Cc: PostgreSQL-development; Jeff Janes
Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I don't think the unlocked increment of nextVictimBuffer is a good

idea

though. nextVictimBuffer jumping over NBuffers under concurrency

seems

like a recipe for disaster to me. At the very, very least it will

need a

good wad of comments explaining what it means and how you're

allowed to

use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a

spinlock

it's not clear if you're improving things; it would have to be

proven

and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

all right: well, I'll work up another version doing full spinlock and
maybe see things shake out in performance.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

I have some very strong evidence that the problem is coming out of the
buffer allocator. Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue. This problem shows up on fast
machines with fast storage and lots of cores. So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.

Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned.
I am also working on some of the optimizations on similar area, which you
can refer here:

/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@
huawei.com

yup -- just took a quick look at your proposed patch. You're
attacking the 'freelist' side of buffer allocation where my stripped
down patch addresses issues with the clocksweep. I think this is a
good idea but more than I wanted to get into personally.

Good news is that both patches should essentially bolt on together
AFAICT. I propose we do a bit of consolidation of performance testing
efforts and run tests with patch A, B, and AB in various scenarios. I
have a 16 core vm (4gb ram) that I can test with and want to start
with say 2gb database 1gb shared_buffers high concurrency test and see
how it burns in. What do you think? Are you at a point where we can
run some tests?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Merlin Moncure (#9)
Re: StrategyGetBuffer optimization, take 2

Merlin Moncure wrote:
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila
<amit(dot)kapila(at)huawei(dot)com> wrote:

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure
Sent: Thursday, August 08, 2013 12:09 AM
To: Andres Freund
Cc: PostgreSQL-development; Jeff Janes
Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I have some very strong evidence that the problem is coming out of the
buffer allocator. Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue. This problem shows up on fast
machines with fast storage and lots of cores. So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.

Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned.
I am also working on some of the optimizations on similar area, which you
can refer here:

/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@
huawei.com

yup -- just took a quick look at your proposed patch. You're
attacking the 'freelist' side of buffer allocation where my stripped
down patch addresses issues with the clocksweep. I think this is a
good idea but more than I wanted to get into personally.

Good news is that both patches should essentially bolt on together
AFAICT.

True, I also think so as both are trying to reduce contention in same area.

I propose we do a bit of consolidation of performance testing
efforts and run tests with patch A, B, and AB in various scenarios. I
have a 16 core vm (4gb ram) that I can test with and want to start
with say 2gb database 1gb shared_buffers high concurrency test and see
how it burns in. What do you think?

I think this can mainly benefit with large data and shared buffers (>
10G), last year also I had ran few tests with similar idea's but
didn't get much
in with less shared buffers.

Are you at a point where we can
run some tests?

Not now, but I will try to run before/during next CF.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Amit Kapila (#10)
Re: StrategyGetBuffer optimization, take 2

Performance testing this patch is a real bugaboo for me; the VMs I have to
work with are too unstable to give useful results :-(. Need to scrounge up
a doner box somewhere...

On Tue, Aug 13, 2013 at 12:26 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

Show quoted text

Merlin Moncure wrote:
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila
<amit(dot)kapila(at)huawei(dot)com> wrote:

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure
Sent: Thursday, August 08, 2013 12:09 AM
To: Andres Freund
Cc: PostgreSQL-development; Jeff Janes
Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund

<andres(at)2ndquadrant(dot)com>

wrote:

On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

I have some very strong evidence that the problem is coming out of the
buffer allocator. Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue. This problem shows up on fast
machines with fast storage and lots of cores. So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.

Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned.
I am also working on some of the optimizations on similar area, which

you

can refer here:

/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@

huawei.com

yup -- just took a quick look at your proposed patch. You're
attacking the 'freelist' side of buffer allocation where my stripped
down patch addresses issues with the clocksweep. I think this is a
good idea but more than I wanted to get into personally.

Good news is that both patches should essentially bolt on together
AFAICT.

True, I also think so as both are trying to reduce contention in same area.

I propose we do a bit of consolidation of performance testing
efforts and run tests with patch A, B, and AB in various scenarios. I
have a 16 core vm (4gb ram) that I can test with and want to start
with say 2gb database 1gb shared_buffers high concurrency test and see
how it burns in. What do you think?

I think this can mainly benefit with large data and shared buffers (>
10G), last year also I had ran few tests with similar idea's but
didn't get much
in with less shared buffers.

Are you at a point where we can
run some tests?

Not now, but I will try to run before/during next CF.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12Jim Nasby
jnasby@enova.com
In reply to: Merlin Moncure (#11)
Re: StrategyGetBuffer optimization, take 2

On 8/14/13 8:30 AM, Merlin Moncure wrote:

Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to give useful results :-(. Need to scrounge up a doner box somewhere...

I offered a server or two to the community a while ago but I don't think anything was ever resolved. It wouldn't have a massive number of cores (only 24 IIRC), but it would have a lot of memory (definitely over 192G; maybe more).

The community just needs to decide it wants it and where it gets shipped to...
--
Jim Nasby, Lead Data Architect
(512) 569-9461 (primary) (512) 579-9024 (backup)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Merlin Moncure (#11)
Re: StrategyGetBuffer optimization, take 2

On Wed, Aug 14, 2013 at 7:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

Performance testing this patch is a real bugaboo for me; the VMs I have to
work with are too unstable to give useful results :-(. Need to scrounge up
a doner box somewhere...

While doing performance tests in this area, I always had a feeling
that OS layer (scheduler that flushes OS buffers)
had a role to play here, So I use to reboot m/c between tests,
ofcourse taking performance data in this fashion is tedious.
I think it is good to run tests on m/c with configuration similar
to what Jim Nasby mentioned in his below mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Merlin Moncure (#1)
Re: StrategyGetBuffer optimization, take 2

On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case. What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

This is an interesting theory, but where's the evidence? I've seen
spinlock contention come from enough different places to be wary of
arguments that start with "it must be happening because...".

IMHO, the thing to do here is run perf record -g during one of the
trouble periods. The performance impact is quite low. You could
probably even set up a script that runs perf for five minute intervals
at a time and saves all of the perf.data files. When one of these
spikes happens, grab the one that's relevant.

If you see that s_lock is where all the time is going, then you've
proved it's a PostgreSQL spinlock rather than something in the kernel
or a shared library. If you can further see what's calling s_lock
(which should hopefully be possible with perf -g), then you've got it
nailed dead to rights.

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#14)
Re: StrategyGetBuffer optimization, take 2

On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case. What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

This is an interesting theory, but where's the evidence? I've seen
spinlock contention come from enough different places to be wary of
arguments that start with "it must be happening because...".

Absolutely. My evidence is circumstantial at best -- let's call it a
hunch. I also do not think we are facing pure spinlock contention,
but something more complex which is a combination of spinlocks, the
free list lwlock, and the linux scheduler. This problem showed up
going from RHEL 5->6 which brought a lot of scheduler changes. A lot
of other things changed too, but the high sys cpu really suggests we
are getting some feedback from the scheduler.

IMHO, the thing to do here is run perf record -g during one of the
trouble periods. The performance impact is quite low. You could
probably even set up a script that runs perf for five minute intervals
at a time and saves all of the perf.data files. When one of these
spikes happens, grab the one that's relevant.

Unfortunately -- that's not on the table. Dropping shared buffers to
2GB (thanks RhodiumToad) seems to have fixed the issue and there is
zero chance I will get approval to revert that setting in order to
force this to re-appear. So far, I have not been able to reproduce in
testing. By the way, this problem has popped up in other places too;
and the typical remedies are applied until it goes away :(.

If you see that s_lock is where all the time is going, then you've
proved it's a PostgreSQL spinlock rather than something in the kernel
or a shared library. If you can further see what's calling s_lock
(which should hopefully be possible with perf -g), then you've got it
nailed dead to rights.

Well, I don't think it's that simple. So my plan of action is this:
1) Improvise a patch that removes one *possible* trigger for the
problem, or at least makes it much less likely to occur. Also, in
real world cases where usage_count is examined N times before
returning a candidate buffer, the amount of overall spinlocking from
buffer allocating is reduced by approximately (N-1)/N. Even though
spin locking is cheap, it's hard to argue with that...

2) Exhaustively performance test patch #1. I think this is win-win
since SGB clock sweep loop is quite frankly relatively un-optimized. I
don't see how reducing the amount of locking could hurt performance
but I've been, uh, wrong about these types of things before.

3) If a general benefit without downside is shown from #2, I'll simply
advance the patch for the next CF and see how things shake out. If and
when I feel like there's a decent shot at getting accepted, I may go
through the motions of setting up a patched server in production and
attempting shared buffers. But that's a long way off.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#1)
Re: StrategyGetBuffer optimization, take 2

On Mon, Aug 5, 2013 at 8:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case. What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled

It seems more likely to me that it got descheduled immediately after
obtaining the hot buffer header spinlock but before releasing it,
rather than while still spinning for it.

6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6

Lots and lots of processes (which want the buffer to use, not to
evict) are spinning for the lock held by the descheduled process on
the buffer header, and the scheduler goes bezerk. A bunch of other
processes are waiting on the relation extension lock, but are doing so
non-bezerk-ly on the semaphore.

Each spinning process puts itself to sleep without having consumed its
slice once it hits spins_per_delay, and so the scheduler keeps
rescheduling them; rather than scheduling the one that is holding the
lock, which consumed its entire slice and so is on the low priority to
reschedule list.

That is my theory, of course. I'm not sure if it leads to a different
course of action than the theory it has not yet obtained the header
lock.

Any idea what spins_per_delay has converged to? There seems to be no
way to obtain this info, other than firing up the debugger. I had a
patch somewhere that has every process elog(LOG,...) the value which
it fetches from shared memory immediately upon start-up. It is a pain
to hunt down the patch, apply, and compiler and restart the server
every time I want this value. Maybe something like this could be put
in core with a suitable condition, but I don't know what that
condition would be. Or would creating a C function with a
pg_spins_per_delay() wrapper function which returns this value on
demand be the way to go?

Is the analysis and the patch to fix the perceived problem plausible
without breaking other stuff.. If so, I'm inclined to go further with
this. This is not the only solution on the table for high buffer
contention, but IMNSHO it should get a lot of points for being very
localized. Maybe a reduced version could be tried retaining the
freelist lock but keeping the 'trylock' on the buf header.

My concern is how we can ever move this forward. If we can't recreate
it on a test system, and you probably won't be allowed to push
experimental patches to the production system....what's left?

Also, if the kernel is introducing new scheduling bottlenecks, are we
just playing whack-a-mole by trying to deal with it in PG code?

Stepping back a bit, have you considered a connection pooler to
restrict the number of simultaneously active connections? It wouldn't
be the first time that that has alleviated the symptoms of these
high-RAM kernel bottlenecks. (If we are going to play whack-a-mole,
might as well use a hammer that already exists and is well tested)

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#4)
Re: StrategyGetBuffer optimization, take 2

On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen. An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
what you need it to? If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

It seem safer just not to decrement if we can't get the lock.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#16)
Re: StrategyGetBuffer optimization, take 2

On Mon, Aug 19, 2013 at 5:02 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

My concern is how we can ever move this forward. If we can't recreate
it on a test system, and you probably won't be allowed to push
experimental patches to the production system....what's left?

Also, if the kernel is introducing new scheduling bottlenecks, are we
just playing whack-a-mole by trying to deal with it in PG code?

I think I can rest on the basis that there is no good reason to sit
and spin on a buffer during clock sweep -- even if I have to whittle
the patch down to the '1 liner' variant that trylocks the buffer
header. I also like the idea of getting rid of the freelist lock
completely but I think those ideas can be tested in isolation. Really
though, the acid-test is going to be exhaustive performance testing.

Stepping back a bit, have you considered a connection pooler to
restrict the number of simultaneously active connections? It wouldn't
be the first time that that has alleviated the symptoms of these
high-RAM kernel bottlenecks. (If we are going to play whack-a-mole,
might as well use a hammer that already exists and is well tested)

Certainly. It was the very first thing I suggested upon hearing about
this problem. Unfortunately, when these kinds of things happen in
high scale, high load databases on serious hardware improvising
pgbouncer is simply not possible without first going through extensive
performance and application compatibility testing.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#17)
Re: StrategyGetBuffer optimization, take 2

On Mon, Aug 19, 2013 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen. An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
what you need it to? If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

It seem safer just not to decrement if we can't get the lock.

Hurk -- well, maybe it should be changed to signed in this
implementation (adjustment w/o lock).

Safer maybe, but you lose a part of the optimization: not having to
spam cache line locks as you constantly spinlock your way around the
clock. Maybe that doesn't matter much -- I'm inclined to test it both
ways and see -- plus maybe a 3rd variant that manages the freelist
itself with a spinlock as well.

variant A: buffer spinlock -> trylock (1 liner!)
variant B: as above, but usage_count manipulation occurs outside of lock
variant C: as above, but dispense with lwlock wholly or in part (plus
possibly stuff the freelist etc).

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Janes (#17)
Re: StrategyGetBuffer optimization, take 2

On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:

On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen. An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
what you need it to? If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

Overflow of *unsigned* variables is actually defined and will always
wrap around. It's signed variables which don't have such a clear
behaviour.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#20)
Re: StrategyGetBuffer optimization, take 2

On Tue, Aug 20, 2013 at 1:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:

On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep. This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen. An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
what you need it to? If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

Overflow of *unsigned* variables is actually defined and will always
wrap around. It's signed variables which don't have such a clear
behaviour.

Hm, well, even better would be to leave things as they are and try to
guarantee that usage_count is updated via assignment vs increment;
that way it would be impossible to wander out of bounds. I bet
changing:
i--; to i=(i-1);

isn't going to do much against modern compilers. But what about
assignment from a volatile temporary?

volatile v = usage_count;
if (v > 0) v--;
usage_count = v;

something like that. Or maybe declaring usage_count as volatile might
be enough?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers