Block level concurrency during recovery

Started by Simon Riggsabout 17 years ago19 messages

simon@2ndQuadrant.com

about 17 years ago

I'm looking at how to make queries safe during recovery, in the presence
of concurrent changes to blocks. In particular, concurrent removal of
rows that might be read by queries.

My thinking is
* we ignore LockRelationForExtension(). Normal queries never request it.
All new blocks were created with that lock held and we are replaying
changes serially, so we do not need to re-create that lock. We already
do this, so no change.
* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

This means we will need to hold cleanup lock:

* while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)

* while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
is not always generated by VACUUM. So split this WAL record into two
types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
I'd rather do the full locking now and relax it later).

* Whenever we apply a backup block that performs the same function as
any of the above WAL records. So we would hold Cleanup lock when
applying WAL records of types
all RM_HEAP2_ID types
XLOG_BTREE_VACUUM

I'm most of the way through implementing the above and will post patch
as a separate item to make it easier to inspect.

Other aspects:

* For GIN indexes, we appear to not hold a Cleanup lock during
vacuuming, except on root page. That stops new scans from starting, but
it doesn't prevent progress of concurrent scans. Doesn't look correct to
me... so not sure what strength lock to acquire in each case. Probably
need to differentiate between WAL record types so we can tell which to
acquire CleanupLock for and which not.

* GIST doesn't use CleaupLocks at all. So I'm very unclear here.

Teodor has mentioned that it should be OK for GIST/GIN. Can I double
check that based upon my inspection of the code?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Gregory Stark

stark@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#1)

Re: Block level concurrency during recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

I'm not sure I'm following what you're talking about. Are you saying the slave
will have to stop when it reaches a cleanup log entry and wait until it can
obtain the exclusive lock with no other pins?

That sounds like the right thing though it could cause a long pause in WAL
recovery. Perhaps we should plan on being able to kill other processes holding
pins on the buffer just as we discussed killing transactions holding up
advancing the globalxmin.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Gregory Stark (#2)

Re: Block level concurrency during recovery

On Mon, 2008-10-20 at 14:23 +0100, Gregory Stark wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

I'm not sure I'm following what you're talking about. Are you saying the slave
will have to stop when it reaches a cleanup log entry and wait until it can
obtain the exclusive lock with no other pins?

Yes.

That sounds like the right thing though it could cause a long pause in WAL
recovery.

It could.

Perhaps we should plan on being able to kill other processes holding
pins on the buffer just as we discussed killing transactions holding up
advancing the globalxmin.

I'll copy my notes straight out of the Wiki:

We can't cancel a query that holds a pin on a particular block since we
don't keep track of who holds a pin. We just know a backend has a pin
and that the startup process must wait.

This may become a problem, or it may not. In most cases, backends hold
5-15 pins concurrently and pins are held for relatively short times.
Startup process will provide debug information for when pin times are
extended, and for monitoring total delay from pin waits.

Some strategies, if this becomes a problem:

* minimise pin hold times wherever this occurs
* deferring application of WAL for pinned blocks (requiring us to
use conditional locking)
* making VACUUM FREEZE hold stronger locks on standby to prevent
query access (but doesn't help HOT)
* ensuring that we perform read ahead I/O for WAL records that
have not yet reached head of apply queue

Most people on-list were comfortable with the idea of recovery waiting
for queries to finish, up to a limit, re: discussion around
max_standby_delay. So waiting for cleanup locks is just another source
of potential wait time.

I think it will take a while for people to come up with some user
experience of whether this will be a pain or not.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Teodor Sigaev

teodor@sigaev.ru

about 17 years ago

In reply to: Simon Riggs (#1)

Re: Block level concurrency during recovery

* For GIN indexes, we appear to not hold a Cleanup lock during
vacuuming, except on root page. That stops new scans from starting, but
it doesn't prevent progress of concurrent scans. Doesn't look correct to
me... so not sure what strength lock to acquire in each case. Probably

Why do you think so?

need to differentiate between WAL record types so we can tell which to
acquire CleanupLock for and which not.

* GIST doesn't use CleaupLocks at all. So I'm very unclear here.

Scan process does not hold any pointer on page now, insertion process holds but
it tracks changes of page by looking at LSN of page.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Teodor Sigaev (#4)

Re: Block level concurrency during recovery

On Mon, 2008-10-20 at 18:28 +0400, Teodor Sigaev wrote:

* For GIN indexes, we appear to not hold a Cleanup lock during
vacuuming, except on root page. That stops new scans from starting, but
it doesn't prevent progress of concurrent scans. Doesn't look correct to
me... so not sure what strength lock to acquire in each case. Probably

Why do you think so?

I have questions.

I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
root page for Cleanup and no others. Why do we need to hold Cleanup lock
on the root? If the index is concurrent safe for existing scans, why is
it not safe for new scans also? And the converse: if it is not safe for
new scans, why is it safe for existing scans?

need to differentiate between WAL record types so we can tell which to
acquire CleanupLock for and which not.

* GIST doesn't use CleaupLocks at all. So I'm very unclear here.

Scan process does not hold any pointer on page now, insertion process holds but
it tracks changes of page by looking at LSN of page.

Sounds good.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Teodor Sigaev

teodor@sigaev.ru

about 17 years ago

In reply to: Simon Riggs (#5)

Re: Block level concurrency during recovery

I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
root page for Cleanup and no others. Why do we need to hold Cleanup lock
on the root? If the index is concurrent safe for existing scans, why is
it not safe for new scans also? And the converse: if it is not safe for
new scans, why is it safe for existing scans?

Because we wish to prevent concurrent inserts and page deletion just to simplify
code. LockForCleanup guarantees that insertion process is not work here (it
keeps root buffer pinned all time of insertion). New scan processes can't start
as a side effect.

Note, in most cases it keeps enough concurrence because all that is about work
on one tree in GIN index. Usually, there is a lot of such trees in index - for
each lexeme if we speak about tsearch index. So, there is a place for
improvements but I don't believe that will give a big advantage for performance
in typical usage of GIN.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Teodor Sigaev (#6)

Re: Block level concurrency during recovery

On Mon, 2008-10-20 at 20:12 +0400, Teodor Sigaev wrote:

I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
root page for Cleanup and no others. Why do we need to hold Cleanup lock
on the root? If the index is concurrent safe for existing scans, why is
it not safe for new scans also? And the converse: if it is not safe for
new scans, why is it safe for existing scans?

Because we wish to prevent concurrent inserts and page deletion just to simplify
code. LockForCleanup guarantees that insertion process is not work here (it
keeps root buffer pinned all time of insertion). New scan processes can't start
as a side effect.

But does holding cleanup lock on root prevent an in-progress Insert from
changing non-root pages? I assume so, just not sure how.

Note, in most cases it keeps enough concurrence because all that is about work
on one tree in GIN index. Usually, there is a lot of such trees in index - for
each lexeme if we speak about tsearch index. So, there is a place for
improvements but I don't believe that will give a big advantage for performance
in typical usage of GIN.

I'm just worried about safety during Hot Standby, not trying to improve
anything.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Teodor Sigaev

teodor@sigaev.ru

about 17 years ago

In reply to: Simon Riggs (#7)

Re: Block level concurrency during recovery

But does holding cleanup lock on root prevent an in-progress Insert from
changing non-root pages? I assume so, just not sure how.

Yes, because insertion process doesn't unpin root page until insert will done.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Simon Riggs (#1)

1 attachment(s)

Re: Block level concurrency during recovery

On Mon, 2008-10-20 at 13:54 +0100, Simon Riggs wrote:

I'm looking at how to make queries safe during recovery, in the presence
of concurrent changes to blocks. In particular, concurrent removal of
rows that might be read by queries.

My thinking is
* we ignore LockRelationForExtension(). Normal queries never request it.
All new blocks were created with that lock held and we are replaying
changes serially, so we do not need to re-create that lock. We already
do this, so no change.
* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

This means we will need to hold cleanup lock:

* while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)

* while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
is not always generated by VACUUM. So split this WAL record into two
types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
I'd rather do the full locking now and relax it later).

* Whenever we apply a backup block that performs the same function as
any of the above WAL records. So we would hold Cleanup lock when
applying WAL records of types
all RM_HEAP2_ID types
XLOG_BTREE_VACUUM

I'm most of the way through implementing the above and will post patch
as a separate item to make it easier to inspect.

Here's the code to implement this portion.

I'm undecided as to whether this is too much code or too little. I'm
somewhat uncertain as to the locking requirements for the physical scan
during a vacuum. I've worked out various options if we need to change
this.

Also, the way I take a cleanup lock during RestoreBkpBlocks() seems very
ugly, but too much additional code seems over the top also.

I'm integrating into rest of patch now.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

hs_block.v2b.patchtext/x-patch; charset=utf-8; name=hs_block.v2b.patchDownload

Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.265
diff -c -r1.265 heapam.c
*** src/backend/access/heap/heapam.c	8 Oct 2008 01:14:44 -0000	1.265
--- src/backend/access/heap/heapam.c	22 Oct 2008 22:00:36 -0000
***************
*** 4033,4039 ****
  	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
! 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
--- 4033,4039 ----
  	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
! 	buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
***************
*** 4082,4088 ****
  	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
! 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
--- 4082,4088 ----
  	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
! 	buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.167
diff -c -r1.167 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c	11 Jun 2008 08:38:56 -0000	1.167
--- src/backend/access/nbtree/nbtinsert.c	22 Oct 2008 22:00:36 -0000
***************
*** 1924,1930 ****
  	}
  
  	if (ndeletable > 0)
! 		_bt_delitems(rel, buffer, deletable, ndeletable);
  
  	/*
  	 * Note: if we didn't find any LP_DEAD items, then the page's
--- 1924,1930 ----
  	}
  
  	if (ndeletable > 0)
! 		_bt_delitems(rel, buffer, deletable, ndeletable, false);
  
  	/*
  	 * Note: if we didn't find any LP_DEAD items, then the page's
Index: src/backend/access/nbtree/nbtpage.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/nbtree/nbtpage.c,v
retrieving revision 1.111
diff -c -r1.111 nbtpage.c
*** src/backend/access/nbtree/nbtpage.c	30 Sep 2008 10:52:10 -0000	1.111
--- src/backend/access/nbtree/nbtpage.c	22 Oct 2008 22:00:36 -0000
***************
*** 649,658 ****
   *
   * This routine assumes that the caller has pinned and locked the buffer.
   * Also, the given itemnos *must* appear in increasing order in the array.
   */
  void
  _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems)
  {
  	Page		page = BufferGetPage(buf);
  	BTPageOpaque opaque;
--- 649,660 ----
   *
   * This routine assumes that the caller has pinned and locked the buffer.
   * Also, the given itemnos *must* appear in increasing order in the array.
+  * If (needCleanupLock) we assume the lock is a super-exclusive lock,
+  * not just a normal block ExclusiveLock.
   */
  void
  _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, bool needCleanupLock)
  {
  	Page		page = BufferGetPage(buf);
  	BTPageOpaque opaque;
***************
*** 715,721 ****
  		rdata[1].buffer_std = true;
  		rdata[1].next = NULL;
  
! 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
  
  		PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
--- 717,726 ----
  		rdata[1].buffer_std = true;
  		rdata[1].next = NULL;
  
! 		if (needCleanupLock)
! 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
! 		else
! 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
  
  		PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
Index: src/backend/access/nbtree/nbtree.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/nbtree/nbtree.c,v
retrieving revision 1.163
diff -c -r1.163 nbtree.c
*** src/backend/access/nbtree/nbtree.c	6 Oct 2008 08:04:11 -0000	1.163
--- src/backend/access/nbtree/nbtree.c	22 Oct 2008 22:00:36 -0000
***************
*** 857,863 ****
  		 */
  		if (ndeletable > 0)
  		{
! 			_bt_delitems(rel, buf, deletable, ndeletable);
  			stats->tuples_removed += ndeletable;
  			/* must recompute maxoff */
  			maxoff = PageGetMaxOffsetNumber(page);
--- 857,863 ----
  		 */
  		if (ndeletable > 0)
  		{
! 			_bt_delitems(rel, buf, deletable, ndeletable, true);
  			stats->tuples_removed += ndeletable;
  			/* must recompute maxoff */
  			maxoff = PageGetMaxOffsetNumber(page);
Index: src/backend/access/nbtree/nbtxlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/nbtree/nbtxlog.c,v
retrieving revision 1.52
diff -c -r1.52 nbtxlog.c
*** src/backend/access/nbtree/nbtxlog.c	12 Jun 2008 09:12:30 -0000	1.52
--- src/backend/access/nbtree/nbtxlog.c	22 Oct 2008 22:00:36 -0000
***************
*** 459,465 ****
  }
  
  static void
! btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
  	xl_btree_delete *xlrec;
  	Buffer		buffer;
--- 459,465 ----
  }
  
  static void
! btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record, bool needCleanupLock)
  {
  	xl_btree_delete *xlrec;
  	Buffer		buffer;
***************
*** 470,476 ****
  		return;
  
  	xlrec = (xl_btree_delete *) XLogRecGetData(record);
! 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
--- 470,479 ----
  		return;
  
  	xlrec = (xl_btree_delete *) XLogRecGetData(record);
! 	if (needCleanupLock)
! 		buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false);
! 	else
! 		buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
***************
*** 737,744 ****
  		case XLOG_BTREE_SPLIT_R_ROOT:
  			btree_xlog_split(false, true, lsn, record);
  			break;
  		case XLOG_BTREE_DELETE:
! 			btree_xlog_delete(lsn, record);
  			break;
  		case XLOG_BTREE_DELETE_PAGE:
  		case XLOG_BTREE_DELETE_PAGE_META:
--- 740,750 ----
  		case XLOG_BTREE_SPLIT_R_ROOT:
  			btree_xlog_split(false, true, lsn, record);
  			break;
+ 		case XLOG_BTREE_VACUUM:
+ 			btree_xlog_delete(lsn, record, true);
+ 			break;
  		case XLOG_BTREE_DELETE:
! 			btree_xlog_delete(lsn, record, false);
  			break;
  		case XLOG_BTREE_DELETE_PAGE:
  		case XLOG_BTREE_DELETE_PAGE_META:
***************
*** 753,758 ****
--- 759,797 ----
  	}
  }
  
+ bool
+ btree_needs_cleanup_lock(uint8 info)
+ {
+ 	switch (info)
+ 	{
+ 		case XLOG_BTREE_VACUUM:
+ 			return true;
+ 			break;
+ 
+ 		case XLOG_BTREE_INSERT_LEAF:
+ 		case XLOG_BTREE_INSERT_UPPER:
+ 		case XLOG_BTREE_INSERT_META:
+ 		case XLOG_BTREE_SPLIT_L:
+ 		case XLOG_BTREE_SPLIT_R:
+ 		case XLOG_BTREE_SPLIT_L_ROOT:
+ 		case XLOG_BTREE_SPLIT_R_ROOT:
+ 		case XLOG_BTREE_DELETE:
+ 		case XLOG_BTREE_DELETE_PAGE:
+ 		case XLOG_BTREE_DELETE_PAGE_META:
+ 		case XLOG_BTREE_DELETE_PAGE_HALF:
+ 		case XLOG_BTREE_NEWROOT:
+ 			return false;
+ 			break;
+ 
+ 		default:
+ 			elog(PANIC, "btree_needs_cleanup_lock: unknown op code %u", info);
+ 	}
+ 
+ 	/* never reached */
+ 	return false;
+ 
+ }
+ 
  static void
  out_target(StringInfo buf, xl_btreetid *target)
  {
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.319
diff -c -r1.319 xlog.c
*** src/backend/access/transam/xlog.c	23 Sep 2008 09:20:35 -0000	1.319
--- src/backend/access/transam/xlog.c	22 Oct 2008 22:03:33 -0000
***************
*** 2887,2892 ****
--- 2887,2905 ----
  	BkpBlock	bkpb;
  	char	   *blk;
  	int			i;
+ 	int			mode;
+ 	RmgrId		rmid = record->xl_rmid;
+ 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+ 
+ 	/*
+ 	 * What kind of lock do we need to apply the backup blocks?
+ 	 * Ugly hack around needing an additioning RMgr API call.
+ 	 */
+ 	if (rmid == RM_HEAP2_ID || 
+ 		(rmid == RM_BTREE_ID && btree_needs_cleanup_lock(info)))
+ 		mode = BUFFER_LOCK_CLEANUP;
+ 	else
+ 		mode = BUFFER_LOCK_EXCLUSIVE;
  
  	blk = (char *) XLogRecGetData(record) + record->xl_len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
***************
*** 2898,2904 ****
  		blk += sizeof(BkpBlock);
  
  		buffer = XLogReadBufferWithFork(bkpb.node, bkpb.fork, bkpb.block,
! 										true);
  		Assert(BufferIsValid(buffer));
  		page = (Page) BufferGetPage(buffer);
  
--- 2911,2917 ----
  		blk += sizeof(BkpBlock);
  
  		buffer = XLogReadBufferWithFork(bkpb.node, bkpb.fork, bkpb.block,
! 										true, mode);
  		Assert(BufferIsValid(buffer));
  		page = (Page) BufferGetPage(buffer);
  
Index: src/backend/access/transam/xlogutils.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.59
diff -c -r1.59 xlogutils.c
*** src/backend/access/transam/xlogutils.c	30 Sep 2008 10:52:11 -0000	1.59
--- src/backend/access/transam/xlogutils.c	22 Oct 2008 22:00:36 -0000
***************
*** 228,234 ****
  Buffer
  XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
  {
! 	return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init);
  }
  
  /*
--- 228,240 ----
  Buffer
  XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
  {
! 	return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init, BUFFER_LOCK_EXCLUSIVE);
! }
! 
! Buffer
! XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init)
! {
! 	return XLogReadBufferWithFork(rnode, MAIN_FORKNUM, blkno, init, BUFFER_LOCK_CLEANUP);
  }
  
  /*
***************
*** 238,244 ****
   */
  Buffer
  XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! 					   BlockNumber blkno, bool init)
  {
  	BlockNumber lastblock;
  	Buffer		buffer;
--- 244,250 ----
   */
  Buffer
  XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! 					   BlockNumber blkno, bool init, int mode)
  {
  	BlockNumber lastblock;
  	Buffer		buffer;
***************
*** 289,295 ****
  		Assert(BufferGetBlockNumber(buffer) == blkno);
  	}
  
! 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  	if (!init)
  	{
--- 295,306 ----
  		Assert(BufferGetBlockNumber(buffer) == blkno);
  	}
  
! 	if (mode == BUFFER_LOCK_EXCLUSIVE)
! 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 	else if (mode == BUFFER_LOCK_CLEANUP)
! 		LockBufferForCleanup(buffer);
! 	else
! 		elog(FATAL, "Invalid buffer lock mode %d", mode);
  
  	if (!init)
  	{
Index: src/backend/optimizer/util/var.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/optimizer/util/var.c,v
retrieving revision 1.81
diff -c -r1.81 var.c
*** src/backend/optimizer/util/var.c	21 Oct 2008 20:42:53 -0000	1.81
--- src/backend/optimizer/util/var.c	22 Oct 2008 19:47:21 -0000
***************
*** 698,703 ****
--- 698,706 ----
  {
  	if (node == NULL)
  		return NULL;
+ 
+ 	Assert(!IsA(node, AppendRelInfo));
+ 
  	if (IsA(node, Var))
  	{
  		Var		   *var = (Var *) node;
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.239
diff -c -r1.239 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	20 Oct 2008 21:11:15 -0000	1.239
--- src/backend/storage/buffer/bufmgr.c	22 Oct 2008 22:00:36 -0000
***************
*** 2364,2369 ****
--- 2364,2370 ----
  
  	Assert(BufferIsValid(buffer));
  	Assert(PinCountWaitBuf == NULL);
+ 	Assert(!IsRecoveryProcessingMode() || InRecovery);
  
  	if (BufferIsLocal(buffer))
  	{
Index: src/backend/storage/freespace/freespace.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/freespace/freespace.c,v
retrieving revision 1.64
diff -c -r1.64 freespace.c
*** src/backend/storage/freespace/freespace.c	1 Oct 2008 14:59:23 -0000	1.64
--- src/backend/storage/freespace/freespace.c	22 Oct 2008 22:00:36 -0000
***************
*** 779,785 ****
  	 * replay of the smgr truncation record to remove completely unused
  	 * pages.
  	 */
! 	buf = XLogReadBufferWithFork(xlrec->node, FSM_FORKNUM, fsmblk, false);
  	if (BufferIsValid(buf))
  	{
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
--- 779,786 ----
  	 * replay of the smgr truncation record to remove completely unused
  	 * pages.
  	 */
! 	buf = XLogReadBufferWithFork(xlrec->node, FSM_FORKNUM, fsmblk, 
! 											false, BUFFER_LOCK_CLEANUP);
  	if (BufferIsValid(buf))
  	{
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.121
diff -c -r1.121 nbtree.h
*** src/include/access/nbtree.h	13 Jul 2008 20:45:47 -0000	1.121
--- src/include/access/nbtree.h	22 Oct 2008 22:05:35 -0000
***************
*** 214,225 ****
  #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
  #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
  #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
! #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuple */
  #define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
  #define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
  #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
  #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
  												 * parent half-dead */
  
  /*
   * All that we need to find changed index tuple
--- 214,226 ----
  #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
  #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
  #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
! #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
  #define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
  #define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
  #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
  #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
  												 * parent half-dead */
+ #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during vacuum */
  
  /*
   * All that we need to find changed index tuple
***************
*** 537,543 ****
  extern void _bt_pageinit(Page page, Size size);
  extern bool _bt_page_recyclable(Page page);
  extern void _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems);
  extern int _bt_pagedel(Relation rel, Buffer buf,
  			BTStack stack, bool vacuum_full);
  
--- 538,544 ----
  extern void _bt_pageinit(Page page, Size size);
  extern bool _bt_page_recyclable(Page page);
  extern void _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, bool needCleanupLock);
  extern int _bt_pagedel(Relation rel, Buffer buf,
  			BTStack stack, bool vacuum_full);
  
Index: src/include/access/xlogutils.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlogutils.h,v
retrieving revision 1.26
diff -c -r1.26 xlogutils.h
*** src/include/access/xlogutils.h	11 Aug 2008 11:05:11 -0000	1.26
--- src/include/access/xlogutils.h	22 Oct 2008 22:06:41 -0000
***************
*** 25,34 ****
  								 BlockNumber nblocks);
  
  extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init);
  extern Buffer XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! 									 BlockNumber blkno, bool init);
  
  extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
  extern void FreeFakeRelcacheEntry(Relation fakerel);
  
  #endif
--- 25,39 ----
  								 BlockNumber nblocks);
  
  extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init);
+ extern Buffer XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init);
  extern Buffer XLogReadBufferWithFork(RelFileNode rnode, ForkNumber forknum,
! 									 BlockNumber blkno, bool init, int mode);
  
  extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
  extern void FreeFakeRelcacheEntry(Relation fakerel);
  
+ /* include this here to avoid including nbtree.h */
+ extern bool btree_needs_cleanup_lock(uint8 info);
+ 
+ 
  #endif
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.115
diff -c -r1.115 bufmgr.h
*** src/include/storage/bufmgr.h	11 Aug 2008 11:05:11 -0000	1.115
--- src/include/storage/bufmgr.h	22 Oct 2008 22:00:36 -0000
***************
*** 58,63 ****
--- 58,66 ----
  #define BUFFER_LOCK_SHARE		1
  #define BUFFER_LOCK_EXCLUSIVE	2
  
+ /* Not used by LockBuffer, but is used by XLogReadBuffer... */
+ #define BUFFER_LOCK_CLEANUP	3
+ 
  /*
   * These routines are beaten on quite heavily, hence the macroization.
   */

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#9)

Re: Block level concurrency during recovery

Simon Riggs wrote:

I'm undecided as to whether this is too much code or too little. I'm
somewhat uncertain as to the locking requirements for the physical scan
during a vacuum. I've worked out various options if we need to change
this.

For the B-tree, an exclusive lock is enough to modify the contents of
the page. A cleanup lock needs to be taken on every page to ensure that
the vacuum doesn't finish and delete a heap tuple that's about to be
accessed by an index scan. That could be handled in WAL replay by
dropping the exclusive lock and re-acquiring the cleanup lock, but we
can't do that for heap vacuums.

However, we require that in b-tree vacuum, you take a cleanup lock on
*every* leaf page of the index, not only those that you modify. That's a
problem, because there's no trace of such pages in the WAL.

PS. FSM doesn't need cleanup locks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#10)

Re: Block level concurrency during recovery

On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

However, we require that in b-tree vacuum, you take a cleanup lock on
*every* leaf page of the index, not only those that you modify. That's a
problem, because there's no trace of such pages in the WAL.

OK, good. Thanks for the second opinion. I'm glad you said that, cos I
felt sure anybody reading the patch would say "what the hell does this
bit do?". Now I can add it.

My solution is fairly simple:

As we pass through the table we keep track of which blocks need
visiting, then append that information onto the next WAL record. If the
last block doesn't contain removed rows, then we send a no-op message
saying which blocks to visit.

I'd already invented the XLOG_BTREE_VACUUM record, so now we just need
to augment it further with two fields: ordered array of blocks to visit,
and a doit flag.

Say we have a 10 block table, with rows to be removed on blocks 3,4,8.
As we visit all 10 in sequence we would issue WAL records:

XLOG_BTREE_VACUUM block 3 visitFirst {1, 2} doit = true
XLOG_BTREE_VACUUM block 4 visitFirst {} doit = true
XLOG_BTREE_VACUUM block 8 visitFirst {5,6,7} doit = true
XLOG_BTREE_VACUUM block 10 visitFirst {9} doit = false

So that allows us to issue the same number of WAL messages yet include
all the required information to repeat the process correctly.

(The blocks can be visited out of sequence in some cases, hence the
ordered array of blocks to visit rather than just a first block value).

It would also be possible to introduce a special tweak there which is
that if the block is not in cache, don't read it in at all. If its not
in cache we know that nobody has a pin on it, so don't need to read it
in just to say "got the lock". That icing for later.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#11)

Re: Block level concurrency during recovery

Simon Riggs wrote:

It would also be possible to introduce a special tweak there which is
that if the block is not in cache, don't read it in at all. If its not
in cache we know that nobody has a pin on it, so don't need to read it
in just to say "got the lock". That icing for later.

Heh, that's clever :-). We could actually use a method like that to
solve the whole problem. After the (replay of the) b-tree vacuum is
finished, scan through all shared buffers, and get a vacuum lock on
every page of that index that's in cache. Groveling through all shared
buffers would be slower for small indexes, though.

I believe the "vacuum lock every leaf page" behavior is only needed for
system catalogs. You have other issues with those still, namely that
table locks are not yet taken appropriately, so I'm not sure if this is
worth worrying about until that's done.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#12)

Re: Block level concurrency during recovery

On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

It would also be possible to introduce a special tweak there which is
that if the block is not in cache, don't read it in at all. If its not
in cache we know that nobody has a pin on it, so don't need to read it
in just to say "got the lock". That icing for later.

Heh, that's clever :-). We could actually use a method like that to
solve the whole problem. After the (replay of the) b-tree vacuum is
finished, scan through all shared buffers, and get a vacuum lock on
every page of that index that's in cache. Groveling through all shared
buffers would be slower for small indexes, though.

Well, re-examining all of the assumptions in the code seems to have been
worthwhile so far. I think that makes 4 significant tweaks that have
come out of the Search For Hot Standby.

I believe the "vacuum lock every leaf page" behavior is only needed for
system catalogs.

So we will still need it then. Which is good 'cos I just wrote it.

You have other issues with those still, namely that
table locks are not yet taken appropriately, so I'm not sure if this is
worth worrying about until that's done.

Please explain. If you know of a correctness issue, please say.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#13)

Re: Block level concurrency during recovery

Simon Riggs wrote:

On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:

You have other issues with those still, namely that
table locks are not yet taken appropriately, so I'm not sure if this is
worth worrying about until that's done.

Please explain. If you know of a correctness issue, please say.

Things like DROP TABLE need to take an AccessExclusiveLock, and you
don't have that yet, no?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#14)

Re: Block level concurrency during recovery

On Thu, 2008-10-23 at 20:24 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:

You have other issues with those still, namely that
table locks are not yet taken appropriately, so I'm not sure if this is
worth worrying about until that's done.

Please explain. If you know of a correctness issue, please say.

Things like DROP TABLE need to take an AccessExclusiveLock, and you
don't have that yet, no?

Oh right, I thought you meant something not already in the plan.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#16

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Simon Riggs (#11)

Re: Block level concurrency during recovery

On Thu, 2008-10-23 at 09:57 +0100, Simon Riggs wrote:

On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

However, we require that in b-tree vacuum, you take a cleanup lock on
*every* leaf page of the index, not only those that you modify. That's a
problem, because there's no trace of such pages in the WAL.

OK, good. Thanks for the second opinion. I'm glad you said that, cos I
felt sure anybody reading the patch would say "what the hell does this
bit do?". Now I can add it.

Heikki,

When we discussed this before, I was glad that you'd mentioned that
aspect since it allowed me to say "if two of us think that then it must
be true".

I didn't include that in the final patch because it felt wrong. I didn't
have a rational explanation for that then, just a bad feeling. So, after
lots of sleep, here's my rational explanation of why we do *not* need
that during hot standby queries:

VACUUM with a btree index proceeds like this:
1. Scan table
2. Remove rows from btree identified in (1)
3. Remove rows from heap identified in (1)

The purpose of the additional locking requirements during (2) for btrees
is to ensure that we do not fail to find the rows identified in (1),
because the rows can move after (1) and during (2) because of block
splits.

Requoting verbatim from the README: "The tricky part of this is to avoid
missing any deletable tuples in the presence of concurrent page splits:
a page split could easily move some tuples from a page not yet passed
over by the sequential scan to a lower-numbered page already passed
over." In recovery there are no concurrent page splits and the WAL
records represent already successfully identified deletable tuples.

On a standby server the rows will not move other than via WAL records.
So there is no possibility that a WAL record will fail to find the row
it was looking for. On the master we were looking for a tuple that
pointed to a htid, whereas in WAL replay we look directly at the index
tuple via its tid, not via the htid it points to. Therefore we do not
need the additional locking.

That seems logical to me, so I will leave that out.

Any alternative views?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#17

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Simon Riggs (#16)

Re: Block level concurrency during recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

VACUUM with a btree index proceeds like this:
1. Scan table
2. Remove rows from btree identified in (1)
3. Remove rows from heap identified in (1)

The purpose of the additional locking requirements during (2) for btrees
is to ensure that we do not fail to find the rows identified in (1),
because the rows can move after (1) and during (2) because of block
splits.

No, you are missing the point. One purpose of the additional locking
requirements is to ensure that there is not a concurrent process that
has read a btree index entry just before you removed it but arrives at
the heap page only after you removed the heap entry (and, perhaps,
replaced it with some other row that doesn't match the index entry at
all). This is clearly still a risk in a hot-standby environment.

regards, tom lane

#18

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Tom Lane (#17)

Re: Block level concurrency during recovery

On Mon, 2008-11-03 at 10:07 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

VACUUM with a btree index proceeds like this:
1. Scan table
2. Remove rows from btree identified in (1)
3. Remove rows from heap identified in (1)

The purpose of the additional locking requirements during (2) for btrees
is to ensure that we do not fail to find the rows identified in (1),
because the rows can move after (1) and during (2) because of block
splits.

No, you are missing the point. One purpose of the additional locking
requirements is to ensure that there is not a concurrent process that
has read a btree index entry just before you removed it but arrives at
the heap page only after you removed the heap entry (and, perhaps,
replaced it with some other row that doesn't match the index entry at
all). This is clearly still a risk in a hot-standby environment.

OK, I think I get it now. Thanks for putting me straight.

So I will implement the locking-every-page approach discussed upthread.
So I will just keep note of the blocks touched exactly in that order and
store the info accordingly onto the WAL records.

Are you happy with my optimisation that if a page needs to be read in,
we just skip it and pretend we did read-pin-unpin on it? I would
implement that as a new ReadBuffer mode (in Heikki's new API
terminology).

If you know/can see any other missing correctness requirements please
let me know. I've not had trouble understanding any of the other
correctness requirements, but I'll leave it to review to judge whether
I've implemented them all correctly.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Simon Riggs (#18)

Re: Block level concurrency during recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

Are you happy with my optimisation that if a page needs to be read in,
we just skip it and pretend we did read-pin-unpin on it?

If it's not in buffers then it cannot be pinned, so I concur that that
is safe.

regards, tom lane