Hot Standby tuning for btree_xlog_vacuum()

Started by Simon Riggsover 15 years ago7 messages

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

1 attachment(s)

Simple tuning of btree_xlog_vacuum() using an idea I had a while back,
just never implemented. XXX comments removed.

Allows us to avoid reading in blocks during VACUUM replay that are only
required for correctness of index scans.

Objections to commit?

--
Simon Riggs www.2ndQuadrant.com

Attachments:

tune_hs_vacuum_replay.patchtext/x-patch; charset=UTF-8; name=tune_hs_vacuum_replay.patchDownload

*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 486,505 **** btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  		for (; blkno < xlrec->block; blkno++)
  		{
  			/*
! 			 * XXX we don't actually need to read the block, we just need to
! 			 * confirm it is unpinned. If we had a special call into the
! 			 * buffer manager we could optimise this so that if the block is
! 			 * not in shared_buffers we confirm it as unpinned.
! 			 *
! 			 * Another simple optimization would be to check if there's any
! 			 * backends running; if not, we could just skip this.
  			 */
! 			buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
! 			if (BufferIsValid(buffer))
! 			{
! 				LockBufferForCleanup(buffer);
! 				UnlockReleaseBuffer(buffer);
! 			}
  		}
  	}
  
--- 486,496 ----
  		for (; blkno < xlrec->block; blkno++)
  		{
  			/*
! 			 * We don't actually need to read the block, we just need to
! 			 * confirm it is unpinned, since if it's not in shared_buffers then
! 			 * we're OK.
  			 */
! 			XLogConfirmBufferIsUnpinned(xlrec->node, MAIN_FORKNUM, blkno);
  		}
  	}
  
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***************
*** 342,347 **** XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
--- 342,377 ----
  	return buffer;
  }
  
+ void
+ XLogConfirmBufferIsUnpinned(RelFileNode rnode, ForkNumber forknum,
+ 							BlockNumber blkno)
+ {
+ 	BlockNumber lastblock;
+ 	SMgrRelation smgr;
+ 
+ 	Assert(blkno != P_NEW);
+ 
+ 	/* Open the relation at smgr level */
+ 	smgr = smgropen(rnode);
+ 
+ 	/*
+ 	 * Create the target file if it doesn't already exist.  This lets us cope
+ 	 * if the replay sequence contains writes to a relation that is later
+ 	 * deleted.  (The original coding of this routine would instead suppress
+ 	 * the writes, but that seems like it risks losing valuable data if the
+ 	 * filesystem loses an inode during a crash.  Better to write the data
+ 	 * until we are actually told to delete the file.)
+ 	 */
+ 	smgrcreate(smgr, forknum, true);
+ 
+ 	lastblock = smgrnblocks(smgr, forknum);
+ 
+ 	if (blkno >= lastblock)
+ 		return;
+ 
+ 	/* page exists in file */
+ 	ConfirmBufferIsUnpinned(rnode, forknum, blkno);
+ }
  
  /*
   * Struct actually returned by XLogFakeRelcacheEntry, though the declared
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 475,480 **** ReadBuffer_common(SMgrRelation smgr, bool isLocalBuf, ForkNumber forkNum,
--- 475,520 ----
  	return BufferDescriptorGetBuffer(bufHdr);
  }
  
+ void
+ ConfirmBufferIsUnpinned(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+ {
+ 	BufferTag	bufTag;			/* identity of requested block */
+ 	uint32		bufHash;		/* hash value for newTag */
+ 	LWLockId	bufPartitionLock;		/* buffer partition lock for it */
+ 	int			buf_id;
+ 	SMgrRelation smgr = smgropen(rnode);
+ 
+ 	/* create a tag so we can lookup the buffer */
+ 	INIT_BUFFERTAG(bufTag, smgr->smgr_rnode, forkNum, blockNum);
+ 
+ 	/* determine its hash code and partition lock ID */
+ 	bufHash = BufTableHashCode(&bufTag);
+ 	bufPartitionLock = BufMappingPartitionLock(bufHash);
+ 
+ 	/* see if the block is in the buffer pool already */
+ 	LWLockAcquire(bufPartitionLock, LW_SHARED);
+ 
+ 	buf_id = BufTableLookup(&bufTag, bufHash);
+ 
+ 	/*
+ 	 * If buffer isn't present it must be unpinned.
+ 	 */
+ 	if (buf_id >= 0)
+ 	{
+ 		volatile BufferDesc *buf;
+ 
+ 		buf = &BufferDescriptors[buf_id];
+ 
+ 		/*
+ 		 * Found it.  Now, pin/unpin the buffer to prove it's unpinned.
+ 		 */
+ 		if (PinBuffer(buf, NULL))
+ 			UnpinBuffer(buf, false);
+ 	}
+ 
+ 	LWLockRelease(bufPartitionLock);
+ }
+ 
  /*
   * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
   *		buffer.  If no buffer exists already, selects a replacement
*** a/src/include/access/xlogutils.h
--- b/src/include/access/xlogutils.h
***************
*** 28,33 **** extern void XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
--- 28,35 ----
  extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init);
  extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  					   BlockNumber blkno, ReadBufferMode mode);
+ extern void XLogConfirmBufferIsUnpinned(RelFileNode rnode, ForkNumber forknum,
+ 							BlockNumber blkno);
  
  extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
  extern void FreeFakeRelcacheEntry(Relation fakerel);
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 163,168 **** extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
--- 163,170 ----
  extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode, bool isTemp,
  						  ForkNumber forkNum, BlockNumber blockNum,
  						  ReadBufferMode mode, BufferAccessStrategy strategy);
+ extern void ConfirmBufferIsUnpinned(RelFileNode rnode, ForkNumber forkNum,
+ 					BlockNumber blockNum);
  extern void ReleaseBuffer(Buffer buffer);
  extern void UnlockReleaseBuffer(Buffer buffer);
  extern void MarkBufferDirty(Buffer buffer);

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Simon Riggs (#1)

Re: Hot Standby tuning for btree_xlog_vacuum()

Simon Riggs <simon@2ndQuadrant.com> writes:

Objections to commit?

This is not the time to be hacking stuff like this. You haven't even
demonstrated that there's a significant performance issue here.

regards, tom lane

Jim Nasby

decibel@decibel.org

over 15 years ago

In reply to: Tom Lane (#2)

Re: Hot Standby tuning for btree_xlog_vacuum()

On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Objections to commit?

This is not the time to be hacking stuff like this. You haven't even
demonstrated that there's a significant performance issue here.

I tend to agree that this point of the cycle isn't a good one to be making changes, but your performance statement confuses me. If a fairly small patch means we can avoid un-necessary reads why shouldn't we avoid them?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Jim Nasby (#3)

Re: Hot Standby tuning for btree_xlog_vacuum()

Jim Nasby <decibel@decibel.org> writes:

On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:

This is not the time to be hacking stuff like this. You haven't even
demonstrated that there's a significant performance issue here.

I tend to agree that this point of the cycle isn't a good one to be making changes, but your performance statement confuses me. If a fairly small patch means we can avoid un-necessary reads why shouldn't we avoid them?

Well, by "time of the cycle" I meant "the day before beta1". I'm not
necessarily averse to making such a change at some point when it would
get more than no testing before hitting our long-suffering beta testers.
But I'd still want to see some evidence that there's a significant
performance improvement to be had.

regards, tom lane

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Tom Lane (#4)

Re: Hot Standby tuning for btree_xlog_vacuum()

On Mon, 2010-05-17 at 16:10 -0400, Tom Lane wrote:

Jim Nasby <decibel@decibel.org> writes:

On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:

This is not the time to be hacking stuff like this. You haven't even
demonstrated that there's a significant performance issue here.

I tend to agree that this point of the cycle isn't a good one to be making changes, but your performance statement confuses me. If a fairly small patch means we can avoid un-necessary reads why shouldn't we avoid them?

Well, by "time of the cycle" I meant "the day before beta1". I'm not
necessarily averse to making such a change at some point when it would
get more than no testing before hitting our long-suffering beta testers.
But I'd still want to see some evidence that there's a significant
performance improvement to be had.

That patch only applies to one record type. However, since we've used
Greg's design of spidering out to each heap record that can usually mean
150-200 random I/Os per btree delete. That will take some time, perhaps
1s per WAL record of this type on a very large I/O bound table. That's
enough to give me cause for concern without performance measurements.

To derive such measurements we'd need to instrument each record type,
which we don't do right now either.

It might be easier to have a look at the patch and see if you think its
worth the fuss of measuring it.

I don't think this is the patch that will correct the potential/
partially observed context switching issue, but we have yet to recreate
that in lab conditions.

--
Simon Riggs www.2ndQuadrant.com

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Simon Riggs (#1)

Re: Hot Standby tuning for btree_xlog_vacuum()

On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Simple tuning of btree_xlog_vacuum() using an idea I had a while back,
just never implemented. XXX comments removed.

Allows us to avoid reading in blocks during VACUUM replay that are only
required for correctness of index scans.

Review:

1. The block comment in XLogConfirmBufferIsUnpinned appears to be
copied from somewhere else, and doesn't really seem appropriate for a
new function since it refers to "the original coding of this routine".
I think you could just delete the parenthesized portion of the
comment.

2. This bit from ConfirmBufferIsUnpinned looks odd to me.

+ 		/*
+ 		 * Found it.  Now, pin/unpin the buffer to prove it's unpinned.
+ 		 */
+ 		if (PinBuffer(buf, NULL))
+ 			UnpinBuffer(buf, false);

I don't think pinning and unpinning the buffer is sufficient to
provide that it isn't otherwise pinned. If the buffer isn't in shared
buffers at all, it seems clear that no one can have it pinned. But if
it's present in shared buffers, it seems like you still need
LockBufferForCleanup().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Simon Riggs

simon@2ndQuadrant.com

about 15 years ago

In reply to: Robert Haas (#6)

Re: Hot Standby tuning for btree_xlog_vacuum()

Just wanted to say thanks for the review, since I haven't yet managed to
fix and commit this. I expect to later this month.

On Mon, 2010-09-27 at 23:06 -0400, Robert Haas wrote:

On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Simple tuning of btree_xlog_vacuum() using an idea I had a while back,
just never implemented. XXX comments removed.

Allows us to avoid reading in blocks during VACUUM replay that are only
required for correctness of index scans.

Review:

1. The block comment in XLogConfirmBufferIsUnpinned appears to be
copied from somewhere else, and doesn't really seem appropriate for a
new function since it refers to "the original coding of this routine".
I think you could just delete the parenthesized portion of the
comment.

2. This bit from ConfirmBufferIsUnpinned looks odd to me.
+ 		/*
+ 		 * Found it.  Now, pin/unpin the buffer to prove it's unpinned.
+ 		 */
+ 		if (PinBuffer(buf, NULL))
+ 			UnpinBuffer(buf, false);
I don't think pinning and unpinning the buffer is sufficient to
provide that it isn't otherwise pinned. If the buffer isn't in shared
buffers at all, it seems clear that no one can have it pinned. But if
it's present in shared buffers, it seems like you still need
LockBufferForCleanup().

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services