Page Scan Mode in Hash Index

Started by Ashutosh Sharmaalmost 9 years ago57 messages
#1Ashutosh Sharma
ashu.coek88@gmail.com
7 attachment(s)

Hi All,

Currently, Hash Index scan works tuple-at-a-time, i.e. for every
qualifying tuple in a page, it acquires and releases the lock which
eventually increases the lock/unlock traffic. For example, if an index
page contains 100 qualified tuples, the current hash index scan has to
acquire and release the lock 100 times to read those qualified tuples
which is not good from performance perspective and it also impacts
concurency with VACUUM.

Considering above points, I would like to propose a patch that allows
hash index scan to work in page-at-a-time mode. In page-at-a-time
mode, once lock is acquired on a target bucket page, the entire page
is scanned and all the qualified tuples are saved into backend's local
memory. This reduces the lock/unlock calls for retrieving tuples from
a page. Moreover, it also eliminates the problem of re-finding the
position of the last returned index tuple and more importanly it
allows VACUUM to release lock on current page before moving to the
next page which eventually improves it's concurrency with scan.

Attached patch modifies hash index scan code for page-at-a-time mode.
For better readability, I have splitted it into 3 parts,

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
this patch basically removes the redundant function _hash_step() and
some of the unused members of HashScanOpaqueData structure.

3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
this patch basically improves the locking strategy for VACUUM in hash
index. As the new hash index scan works in page-at-a-time, vacuum can
release the lock on previous page before acquiring a lock on the next
page, hence, improving hash index concurrency.

Please note that, above patches has to be applied on top of following
patches-- 'WAL in hash index - [1]/messages/by-id/CAA4eK1LTyDHyCmj3pf5KxWgPb1DgNae9ivsB5jX0X_Kt7iLTUA@mail.gmail.com' and 'Microvacuum support for hash
index [2]/messages/by-id/CAA4eK1JfrJoa15XStmRKy6mGsVjKh_aa-EXZY+UZQOV6mGM0QQ@mail.gmail.com'. Note that in current head, marking of dead tuples requires
lock on the page. Now, even if hash index scan is done page-at-a-time,
it would still require a a lock on the page just to mark dead tuples.
Hence, loosing the advantage of page-at-a-time mode. Therefore, I
developed this patch over Microvacuum support for hash index [2]/messages/by-id/CAA4eK1JfrJoa15XStmRKy6mGsVjKh_aa-EXZY+UZQOV6mGM0QQ@mail.gmail.com.

I have also done the benchmarking of this patch and would like to
share the results for the same,

Firstly, I have done the benchmarking with non-unique values and i
could see a performance improvement of 4-7%. For the detailed results
please find the attached file 'results-non-unique values-70ff', and
ddl.sql, test.sql are test scripts used in this experimentation. The
detail of non-default GUC params and pgbench command are mentioned in
the result sheet. I also did the benchmarking with unique values at
300 and 1000 scale factor and its results are provided in
'results-unique-values-default-ff'.

[1]: /messages/by-id/CAA4eK1LTyDHyCmj3pf5KxWgPb1DgNae9ivsB5jX0X_Kt7iLTUA@mail.gmail.com
[2]: /messages/by-id/CAA4eK1JfrJoa15XStmRKy6mGsVjKh_aa-EXZY+UZQOV6mGM0QQ@mail.gmail.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.-T.patchinvalid/octet-stream; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.-T.patchDownload
From 624b9fec20736554326fd69b0da5f9f2fe11c8f3 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:34:56 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a time. This
 eliminates the problem of re-finding the exact stopping point in a index
 page. With this, it also reduces the lock/unlock traffic thereby increasing
 the hash index scan speed by some margin.

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c       | 120 ++-----------
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++++++++++++++++++++++++++++++-----
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h            |  44 +++++
 5 files changed, 379 insertions(+), 152 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index c233c33..d04d035 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -271,65 +271,24 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
-		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		offnum = currItem->indexOffset;
 		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
@@ -346,9 +305,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +316,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +334,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +373,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -482,10 +404,6 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 300a15d..7f6ce55 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -299,20 +299,22 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		so->hashso_bucket_buf == so->currPos.buf)
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->currPos.buf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..96da9b5 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,87 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we release pin and set
+ *		so->currPos.buf to InvalidBuffer.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool        tuples_to_read;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +255,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item(or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +276,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +397,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +614,208 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Returns true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	IndexTuple		itup;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+loop_top_fwd:
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+
+		if (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				goto loop_top_fwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+loop_top_bwd:
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+
+		if (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				goto loop_top_bwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 4810553..c493c27 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -465,6 +465,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber	blkno;
+	Buffer	buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -481,7 +484,19 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -513,6 +528,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index fb6e34f..4efed52 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber	currPage;		/* current hash index page */
+	BlockNumber	nextPage;		/* next overflow page */
+	BlockNumber	prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int		firstItem;			/* first valid index in items[] */
+	int		lastItem;			/* last valid index in items[] */
+	int		itemIndex;			/* current index in items[] */
+
+   HashScanPosItem items[MaxIndexTuplesPerPage];     /* MUST BE LAST */
+} HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchinvalid/octet-stream; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchinvalid/octet-stream; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From ccd7cdfad9f2d14e4aa797ed4b48013fba96b23d Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 8 Feb 2017 11:02:13 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index. As the
 new hash index scan work a page at a time, vacuum can release the lock on
 previous page before trying to acquire lock on a next page thereby improving
 hash index concurrency.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hash.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index c1c1fec..5e1c313 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -828,19 +828,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the new hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
-- 
1.8.3.1

results-non-unique-values-70ff.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=results-non-unique-values-70ff.xlsxDownload
ddl.sqlapplication/sql; name=ddl.sqlDownload
test.sqlapplication/sql; name=test.sqlDownload
results-unique-values-default-ff.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=results-unique-values-default-ff.xlsxDownload
#2Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Ashutosh Sharma (#1)
Re: Page Scan Mode in Hash Index

Hi, Ashutosh!

I've assigned to review this patch.
At first, I'd like to notice that I like idea and general design.
Secondly, patch set don't apply cleanly to master. Please, rebase it.

On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

I see that forward and backward scan cases of function _hash_readpage
contain a lot of code duplication
Could you please refactor this function to have less code duplication?

Also, I wonder if you have a special idea behind inserting data in test.sql
by 1002 separate SQL statements.
INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1,
1000) a;

You can achieve the same result by execution of single SQL statement.
INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM
GENERATE_SERIES(1, 1002000) a;
Unless you have some special idea of doing this in 1002 separate
transactions.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#3Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Alexander Korotkov (#2)
Re: Page Scan Mode in Hash Index

Hi,

I've assigned to review this patch.
At first, I'd like to notice that I like idea and general design.
Secondly, patch set don't apply cleanly to master. Please, rebase it.

Thanks for showing your interest towards this patch. I would like to
inform that this patch has got dependency on patch for 'Write Ahead
Logging in hash index - [1]/messages/by-id/CAA4eK1KibVzgVETVay0+siVEgzaXnP5R21BdWiK9kg9wx2E40Q@mail.gmail.com' and 'Microvacuum support in hash index
[1]: /messages/by-id/CAA4eK1KibVzgVETVay0+siVEgzaXnP5R21BdWiK9kg9wx2E40Q@mail.gmail.com
on rebasing this patch. However, I will try to share you the updated
patch asap.

On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication
Could you please refactor this function to have less code duplication?

Sure, I will try to avoid the code duplication as much as possible.

Also, I wonder if you have a special idea behind inserting data in test.sql by 1002 separate SQL statements.
INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 1000) a;

You can achieve the same result by execution of single SQL statement.
INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM GENERATE_SERIES(1, 1002000) a;
Unless you have some special idea of doing this in 1002 separate transactions.

There is no reason for having so many INSERT statements in test.sql
file. I think it would be better to replace it with single SQL
statement. Thanks.

[1]: /messages/by-id/CAA4eK1KibVzgVETVay0+siVEgzaXnP5R21BdWiK9kg9wx2E40Q@mail.gmail.com
[2]: /messages/by-id/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw@mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#3)
1 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

I've assigned to review this patch.
At first, I'd like to notice that I like idea and general design.
Secondly, patch set don't apply cleanly to master. Please, rebase it.

Thanks for showing your interest towards this patch. I would like to
inform that this patch has got dependency on patch for 'Write Ahead
Logging in hash index - [1]' and 'Microvacuum support in hash index
[1]'. Hence, until above two patches becomes stable I may have to keep
on rebasing this patch. However, I will try to share you the updated
patch asap.

On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

I see that forward and backward scan cases of function _hash_readpage contain a lot of code duplication
Could you please refactor this function to have less code duplication?

Sure, I will try to avoid the code duplication as much as possible.

I had close look into hash_readpage() function and could see that
there are only few if-else conditions which are similar for both
forward and backward scan cases and can't be optimised further.
However, If you have a cursory look into this function definition it
looks like the code for forward and backward scan are exactly the same
but that's not the case. Attached is the diff report
(hash_readpage.html) for forward and backward scan code used in
hash_readpage(). This shows what all lines in the hash_readpage() are
same or different.

Please note that before applying the patches for page scan mode in
hash index you may need to first apply the following patches on HEAD,

1) v10 patch for WAL in hash index - [1]/messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
2) v1 patch for wal consistency check for hash index - [2]/messages/by-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA@mail.gmail.com
3) v6 patch for microvacuum support in hash index - [3]/messages/by-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8=dqLrwwObXKcw@mail.gmail.com

[1]: /messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
[2]: /messages/by-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA@mail.gmail.com
[3]: /messages/by-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8=dqLrwwObXKcw@mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

hash_readpage.htmltext/html; charset=US-ASCII; name=hash_readpage.htmlDownload
#5Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#1)
Re: Page Scan Mode in Hash Index

Hi,

On 02/14/2017 12:27 AM, Ashutosh Sharma wrote:

Currently, Hash Index scan works tuple-at-a-time, i.e. for every
qualifying tuple in a page, it acquires and releases the lock which
eventually increases the lock/unlock traffic. For example, if an index
page contains 100 qualified tuples, the current hash index scan has to
acquire and release the lock 100 times to read those qualified tuples
which is not good from performance perspective and it also impacts
concurency with VACUUM.

Considering above points, I would like to propose a patch that allows
hash index scan to work in page-at-a-time mode. In page-at-a-time
mode, once lock is acquired on a target bucket page, the entire page
is scanned and all the qualified tuples are saved into backend's local
memory. This reduces the lock/unlock calls for retrieving tuples from
a page. Moreover, it also eliminates the problem of re-finding the
position of the last returned index tuple and more importanly it
allows VACUUM to release lock on current page before moving to the
next page which eventually improves it's concurrency with scan.

Attached patch modifies hash index scan code for page-at-a-time mode.
For better readability, I have splitted it into 3 parts,

Due to the commits on master these patches applies with hunks.

The README should be updated to mention the use of page scan.

hash.h needs pg_indent.

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

For _hash_next I don't see this - can you explain ?

+ *
+ *             On failure exit (no more tuples), we release pin and set
+ *             so->currPos.buf to InvalidBuffer.

+ * Returns true if any matching items are found else returns false.

s/Returns/Return/g

2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
this patch basically removes the redundant function _hash_step() and
some of the unused members of HashScanOpaqueData structure.

Looks good.

3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
this patch basically improves the locking strategy for VACUUM in hash
index. As the new hash index scan works in page-at-a-time, vacuum can
release the lock on previous page before acquiring a lock on the next
page, hence, improving hash index concurrency.

+ * As the new hash index scan work in page at a time mode,

Remove 'new'.

I have also done the benchmarking of this patch and would like to
share the results for the same,

Firstly, I have done the benchmarking with non-unique values and i
could see a performance improvement of 4-7%. For the detailed results
please find the attached file 'results-non-unique values-70ff', and
ddl.sql, test.sql are test scripts used in this experimentation. The
detail of non-default GUC params and pgbench command are mentioned in
the result sheet. I also did the benchmarking with unique values at
300 and 1000 scale factor and its results are provided in
'results-unique-values-default-ff'.

I'm seeing similar results, and especially with write heavy scenarios.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#5)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

Attached patch modifies hash index scan code for page-at-a-time mode.
For better readability, I have splitted it into 3 parts,

Due to the commits on master these patches applies with hunks.

The README should be updated to mention the use of page scan.

Done. Please refer to the attached v2 version of patch.

hash.h needs pg_indent.

Fixed.

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

For _hash_next I don't see this - can you explain ?

Sorry, It was wrongly copied from btree code. I have corrected it now. Please
check the attached v2 verison of patch.

+ *
+ *             On failure exit (no more tuples), we release pin and set
+ *             so->currPos.buf to InvalidBuffer.

+ * Returns true if any matching items are found else returns false.

s/Returns/Return/g

Done.

2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
this patch basically removes the redundant function _hash_step() and
some of the unused members of HashScanOpaqueData structure.

Looks good.

3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
this patch basically improves the locking strategy for VACUUM in hash
index. As the new hash index scan works in page-at-a-time, vacuum can
release the lock on previous page before acquiring a lock on the next
page, hence, improving hash index concurrency.

+ * As the new hash index scan work in page at a time mode,

Remove 'new'.

Done.

I have also done the benchmarking of this patch and would like to
share the results for the same,

Firstly, I have done the benchmarking with non-unique values and i
could see a performance improvement of 4-7%. For the detailed results
please find the attached file 'results-non-unique values-70ff', and
ddl.sql, test.sql are test scripts used in this experimentation. The
detail of non-default GUC params and pgbench command are mentioned in
the result sheet. I also did the benchmarking with unique values at
300 and 1000 scale factor and its results are provided in
'results-unique-values-default-ff'.

I'm seeing similar results, and especially with write heavy scenarios.

Great..!!

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev2.patchapplication/x-patch; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev2.patchDownload
From 78445d11db7157908d5558eedd253b9b28fb1e3e Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 22 Mar 2017 18:43:40 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README       |   9 +-
 src/backend/access/hash/hash.c       | 120 ++-----------
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++++++++++++++++++++++++++++++-----
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h            |  44 +++++
 6 files changed, 384 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..2450ee1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,65 +268,24 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
-		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		offnum = currItem->indexOffset;
 		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
@@ -346,9 +305,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +316,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +334,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +373,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -482,10 +404,6 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..8515c28 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,22 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		so->hashso_bucket_buf == so->currPos.buf)
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->currPos.buf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..1f05b1f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,87 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool        tuples_to_read;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +255,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item(or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +276,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +397,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +614,208 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	IndexTuple		itup;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+loop_top_fwd:
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+
+		if (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				goto loop_top_fwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+loop_top_bwd:
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+
+		if (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				goto loop_top_bwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..ecda225 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -463,6 +463,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber	blkno;
+	Buffer	buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -479,7 +482,19 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -511,6 +526,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..3b01e3e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchapplication/x-patch; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchapplication/x-patch; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 532287d11f4fd03e78a613cabb6751f8ab22fa59 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 22 Mar 2017 18:51:22 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2450ee1..3ac8154 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -841,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
-- 
1.8.3.1

#7Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#6)
Re: Page Scan Mode in Hash Index

Hi,

On 03/22/2017 09:32 AM, Ashutosh Sharma wrote:

Done. Please refer to the attached v2 version of patch.

Thanks.

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

0001v2:

In hashgettuple() you can remove the 'currItem' and 'offnum' from the
'else' part, and do the assignment inside

if (so->numKilled < MaxIndexTuplesPerPage)

instead.

No new comments for 0002 and 0003.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#7)
1 attachment(s)
Re: Page Scan Mode in Hash Index

On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Hi,

On 03/22/2017 09:32 AM, Ashutosh Sharma wrote:

Done. Please refer to the attached v2 version of patch.

Thanks.

1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
patch rewrites the hash index scan module to work in page-at-a-time
mode. It basically introduces two new functions-- _hash_readpage() and
_hash_saveitem(). The former is used to load all the qualifying tuples
from a target bucket or overflow page into an items array. The latter
one is used by _hash_readpage to save all the qualifying tuples found
in a page into an items array. Apart from that, this patch bascially
cleans _hash_first(), _hash_next and hashgettuple().

0001v2:

In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else'
part, and do the assignment inside

if (so->numKilled < MaxIndexTuplesPerPage)

instead.

Done. Please have a look into the attached v3 patch.

No new comments for 0002 and 0003.

okay. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev3.patchapplication/x-patch; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev3.patchDownload
From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Thu, 23 Mar 2017 23:36:05 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev3

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README       |   9 +-
 src/backend/access/hash/hash.c       | 121 +++----------
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++++++++++++++++++++++++++++++-----
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h            |  44 +++++
 6 files changed, 385 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..8c28fbd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -346,9 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				offnum = currItem->indexOffset;
+
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +317,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +335,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +374,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -482,10 +405,6 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..8515c28 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,22 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		so->hashso_bucket_buf == so->currPos.buf)
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->currPos.buf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..1f05b1f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,87 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool        tuples_to_read;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +255,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item(or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +276,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +397,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +614,208 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	IndexTuple		itup;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+loop_top_fwd:
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+
+		if (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				goto loop_top_fwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+loop_top_bwd:
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+
+		if (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				goto loop_top_bwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..ecda225 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -463,6 +463,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber	blkno;
+	Buffer	buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -479,7 +482,19 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -511,6 +526,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..3b01e3e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

#9Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#8)
Re: Page Scan Mode in Hash Index

Hi,

On 03/23/2017 02:11 PM, Ashutosh Sharma wrote:

On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

0001v2:

In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else'
part, and do the assignment inside

if (so->numKilled < MaxIndexTuplesPerPage)

instead.

Done. Please have a look into the attached v3 patch.

No new comments for 0002 and 0003.

okay. Thanks.

I'll keep the entry in 'Needs Review' if Alexander, or others, want to
add their feedback.

(Best to post the entire patch series each time)

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#9)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

On 03/23/2017 02:11 PM, Ashutosh Sharma wrote:

On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

0001v2:

In hashgettuple() you can remove the 'currItem' and 'offnum' from the
'else'
part, and do the assignment inside

if (so->numKilled < MaxIndexTuplesPerPage)

instead.

Done. Please have a look into the attached v3 patch.

No new comments for 0002 and 0003.

okay. Thanks.

I'll keep the entry in 'Needs Review' if Alexander, or others, want to add
their feedback.

okay. Thanks.

(Best to post the entire patch series each time)

I take your suggestion. Please find the attachments.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev3.patchapplication/x-patch; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev3.patchDownload
From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Thu, 23 Mar 2017 23:36:05 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev3

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README       |   9 +-
 src/backend/access/hash/hash.c       | 121 +++----------
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++++++++++++++++++++++++++++++-----
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h            |  44 +++++
 6 files changed, 385 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..8c28fbd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -346,9 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				offnum = currItem->indexOffset;
+
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +317,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +335,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +374,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -482,10 +405,6 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 	{
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..8515c28 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,22 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		so->hashso_bucket_buf == so->currPos.buf)
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->currPos.buf = InvalidBuffer;
+	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..1f05b1f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,87 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool        tuples_to_read;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				tuples_to_read = _hash_readpage(scan, &buf, dir);
+				if (!tuples_to_read)
+					return false;
+			}
+			else
+				return false;
+		}
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +255,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item(or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +276,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +397,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +614,208 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	IndexTuple		itup;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+loop_top_fwd:
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+
+		if (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				goto loop_top_fwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+loop_top_bwd:
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+
+		if (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				goto loop_top_bwd;
+			}
+			else
+				return false;
+		}
+		else
+		{
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..ecda225 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -463,6 +463,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber	blkno;
+	Buffer	buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -479,7 +482,19 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -511,6 +526,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..3b01e3e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchapplication/x-patch; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchapplication/x-patch; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 532287d11f4fd03e78a613cabb6751f8ab22fa59 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 22 Mar 2017 18:51:22 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2450ee1..3ac8154 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -841,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
-- 
1.8.3.1

#11Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Sharma (#10)
Re: Page Scan Mode in Hash Index

On Thu, Mar 23, 2017 at 2:35 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I take your suggestion. Please find the attachments.

I think you should consider refactoring this so that it doesn't need
to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again. Notice that the code in the "else" branch of the
if (itemIndex == 0) stuff could actually be removed from the else
block without changing anything, because the "if" block always either
returns or does a goto. That's worthy of a little more work to try to
make things simple and clear.

+ * We find the first item(or, if backward scan, the last item) in

Missing space.

In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now
inconsistent with the handling of hashso_split_bucket_buf, which seems
suspicious. Suppose we enter this function with so->hashso_bucket_buf
and so->currPos.buf both being valid buffers, but not the same one.
It looks to me as if we'll release the first pin but not the second
one. My guess (which could be wrong) is that so->hashso_bucket_buf =
InvalidBuffer should be moved back up higher in the function where it
was before, just after the first if statement, and that the new
condition so->hashso_bucket_buf == so->currPos.buf on the last
_hash_dropbuf() should be removed. If that's not right, then I think
I need somebody to explain why not.

I am suspicious that _hash_kill_items() is going to have problems if
the overflow page is freed before it reacquires the lock.
_btkillitems() contains safeguards against similar cases.

This is not a full review, but I'm out of time for the moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#11)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

I think you should consider refactoring this so that it doesn't need
to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

Notice that the code in the "else" branch of the

if (itemIndex == 0) stuff could actually be removed from the else
block without changing anything, because the "if" block always either
returns or does a goto. That's worthy of a little more work to try to
make things simple and clear.

Corrected.

+ * We find the first item(or, if backward scan, the last item) in

Missing space.

Corrected.

In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now
inconsistent with the handling of hashso_split_bucket_buf, which seems
suspicious.

I have corrected it.

Suppose we enter this function with so->hashso_bucket_buf

and so->currPos.buf both being valid buffers, but not the same one.
It looks to me as if we'll release the first pin but not the second
one.

Yes, that is because we no need to release pin on currPos.buf if it is
an overflow page. In page scan mode once we have saved all the
qualified tuples from a current page (could be an overflow page) into
items array we do release both pin and lock on a overflow page. This
was not true earlier, let us assume a case where we are supposed to
fetch only fixed number of tuples from a page using cursor and in such
a case once the number of tuples to be fetched is completed we simply
return with out releasing pin on a page being scanned. If this page is
an overflow page then by end of scan we will end up with pin held on
two buffers i.e. bucket buf and current buf which is basically
overflow buf.

My guess (which could be wrong) is that so->hashso_bucket_buf =

InvalidBuffer should be moved back up higher in the function where it
was before, just after the first if statement, and that the new
condition so->hashso_bucket_buf == so->currPos.buf on the last
_hash_dropbuf() should be removed. If that's not right, then I think
I need somebody to explain why not.

Okay, as i mentioned above, in case of page scan mode we only keep pin
on a bucket buf. There won't be any case where we will be having pin
on overflow buf at the end of scan. So, basically _hash_dropscan buf()
should only attempt to release pin on a bucket buf. And an attempt to
release pin on so->currPos.buf should only happen when
'so->hashso_bucket_buf == so->currPos.buf' or when
'so->hashso_split_bucket_buf == so->currPos.buf'

I am suspicious that _hash_kill_items() is going to have problems if
the overflow page is freed before it reacquires the lock.
_btkillitems() contains safeguards against similar cases.

I have added similar check for hash_kill_items() as well.

This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev4.patchapplication/x-patch; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev4.patchDownload
From 3d0273f503d1645d6289bda78946a0af4b9e9f3a Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Mon, 27 Mar 2017 18:22:15 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev4

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README       |   9 +-
 src/backend/access/hash/hash.c       | 124 ++--------
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 445 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  42 +++-
 src/include/access/hash.h            |  46 +++-
 6 files changed, 515 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0bacef8..bd2827a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -346,9 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				offnum = currItem->indexOffset;
+
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +317,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +335,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +374,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -478,13 +401,12 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-		_hash_kill_items(scan, false);
+		_hash_kill_items(scan);
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -509,7 +431,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-		_hash_kill_items(scan, false);
+		_hash_kill_items(scan);
 
 	_hash_dropscanbuf(rel, so);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..cd3c679 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,23 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
-	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		((so->hashso_bucket_buf == so->currPos.buf) ||
+		(so->hashso_split_bucket_buf == so->currPos.buf)))
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_split_bucket_buf = InvalidBuffer;
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 414cc6a..80c51d1 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,144 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, OffsetNumber maxoff,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	HashPageOpaque opaque;
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && !so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->currPos.buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+				/*
+				 * setting hashso_buc_split to true indicates that we are
+				 * scanning bucket being split.
+				 */
+				so->hashso_buc_split = true;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+				page = BufferGetPage(buf);
+				opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+				/* move to the end of bucket chain */
+				while (BlockNumberIsValid(opaque->hasho_nextblkno))
+					   _hash_readnext(scan, &buf, &page, &opaque);
+
+				/*
+				 * setting hashso_buc_split to false indicates that we are
+				 * scanning bucket being split.
+				 */
+
+				so->hashso_buc_split = false;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		HashScanPosInvalidate(so->currPos);
+		_hash_dropscanbuf(rel, so);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +312,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +333,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +454,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -467,7 +563,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 					/* Before leaving current page, deal with any killed items */
 					if (so->numKilled > 0)
-						_hash_kill_items(scan, true);
+						_hash_kill_items(scan);
 
 					/*
 					 * ran off the end of this page, try the next
@@ -524,7 +620,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 					/* Before leaving current page, deal with any killed items */
 					if (so->numKilled > 0)
-						_hash_kill_items(scan, true);
+						_hash_kill_items(scan);
 
 					/*
 					 * ran off the end of this page, try the next
@@ -575,3 +671,266 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+						   OffsetNumber maxoff, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, relocate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f24bc4c..17b0946 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -456,17 +456,20 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * The caller must have pin on so->hashso_curbuf, but may or may not have
- * read-lock, as indicated by haveLock.  Note that we assume read-lock
- * is sufficient for setting LP_DEAD hint bits.
+ * The caller must have pin on so->currPos.buf, but will not have
+ * read-lock, on a current page.  Note that we assume read-lock is
+ * sufficient for setting LP_DEAD hint bits.
  *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
  */
 void
-_hash_kill_items(IndexScanDesc scan, bool haveLock)
+_hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber	blkno;
+	Buffer	buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -476,10 +479,7 @@ _hash_kill_items(IndexScanDesc scan, bool haveLock)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
-	Assert(BufferIsValid(so->hashso_curbuf));
-
-	if (!haveLock)
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
+	Assert(BufferIsValid(so->currPos.buf));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -487,7 +487,23 @@ _hash_kill_items(IndexScanDesc scan, bool haveLock)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -519,9 +535,11 @@ _hash_kill_items(IndexScanDesc scan, bool haveLock)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!haveLock)
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 89fc319..3b01e3e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -393,7 +437,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan, bool haveLock);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchapplication/x-patch; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchapplication/x-patch; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 532287d11f4fd03e78a613cabb6751f8ab22fa59 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 22 Mar 2017 18:51:22 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2450ee1..3ac8154 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -841,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
-- 
1.8.3.1

#13Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#12)
Re: Page Scan Mode in Hash Index

Hi,

On 03/27/2017 09:34 AM, Ashutosh Sharma wrote:

Hi,

I think you should consider refactoring this so that it doesn't need
to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

This patch needs a rebase.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#13)
Re: Page Scan Mode in Hash Index

I think you should consider refactoring this so that it doesn't need

to use goto. Maybe move the while (offnum <= maxoff) logic into a
helper function and have it return itemIndex. If itemIndex == 0, you
can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

This patch needs a rebase.

Please try applying these patches on top of [1]/messages/by-id/CAE9k0P=V2LhtyeMXd295fhisp=NWUhRVJ9EZQCDowWiY9rSohQ@mail.gmail.com. I think you should be able
to apply it cleanly. Sorry, I think I forgot to mention this point in my
earlier mail.

[1]: /messages/by-id/CAE9k0P=V2LhtyeMXd295fhisp=NWUhRVJ9EZQCDowWiY9rSohQ@mail.gmail.com
/messages/by-id/CAE9k0P=V2LhtyeMXd295fhisp=NWUhRVJ9EZQCDowWiY9rSohQ@mail.gmail.com

#15Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#14)
Re: Page Scan Mode in Hash Index

Hi Ashutosh,

On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:

This patch needs a rebase.

Please try applying these patches on top of [1]. I think you should be able
to apply it cleanly. Sorry, I think I forgot to mention this point in my
earlier mail.

[1] -
/messages/by-id/CAE9k0P=V2LhtyeMXd295fhisp=NWUhRVJ9EZQCDowWiY9rSohQ@mail.gmail.com

Thanks, that works !

As you have provided a patch for Robert's comments, and no other review
have been posted I'm moving this patch to "Ready for Committer" for
additional committer feedback.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#15)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:

This patch needs a rebase.

Please try applying these patches on top of [1]. I think you should be
able
to apply it cleanly. Sorry, I think I forgot to mention this point in my
earlier mail.

[1] -

/messages/by-id/CAE9k0P=V2LhtyeMXd295fhisp=NWUhRVJ9EZQCDowWiY9rSohQ@mail.gmail.com

Thanks, that works !

As you have provided a patch for Robert's comments, and no other review have
been posted I'm moving this patch to "Ready for Committer" for additional
committer feedback.

Please find the attached new version of patches for page scan mode in
hash index rebased on top of - [1]/messages/by-id/CAE9k0P=3rtgUtxopG+XQVxgASFzAnGd9sNmYUaj_=KeVsKGUdA@mail.gmail.com.

[1]: /messages/by-id/CAE9k0P=3rtgUtxopG+XQVxgASFzAnGd9sNmYUaj_=KeVsKGUdA@mail.gmail.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev5.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev5.patchDownload
From 498723199f4b14ff9917aca13abf30f9ea261ca7 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sat, 1 Apr 2017 12:09:46 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev5

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README       |   9 +-
 src/backend/access/hash/hash.c       | 140 ++---------
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 441 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  29 ++-
 src/include/access/hash.h            |  44 ++++
 6 files changed, 509 insertions(+), 171 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b835f77..bd2827a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -346,9 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
 			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				offnum = currItem->indexOffset;
+
+				so->killedItems[so->numKilled].heapTid = currItem->heapTid;
+				so->killedItems[so->numKilled].indexOffset = offnum;
 				so->numKilled++;
 			}
 		}
@@ -358,30 +317,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +335,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +374,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,23 +399,14 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
+	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -515,17 +429,9 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
+	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-	}
 
 	_hash_dropscanbuf(rel, so);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..cd3c679 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,23 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
-	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		((so->hashso_bucket_buf == so->currPos.buf) ||
+		(so->hashso_split_bucket_buf == so->currPos.buf)))
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_split_bucket_buf = InvalidBuffer;
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..80c51d1 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,144 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, OffsetNumber maxoff,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	HashPageOpaque opaque;
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && !so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->currPos.buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+				/*
+				 * setting hashso_buc_split to true indicates that we are
+				 * scanning bucket being split.
+				 */
+				so->hashso_buc_split = true;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+				page = BufferGetPage(buf);
+				opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+				/* move to the end of bucket chain */
+				while (BlockNumberIsValid(opaque->hasho_nextblkno))
+					   _hash_readnext(scan, &buf, &page, &opaque);
+
+				/*
+				 * setting hashso_buc_split to false indicates that we are
+				 * scanning bucket being split.
+				 */
+
+				so->hashso_buc_split = false;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		HashScanPosInvalidate(so->currPos);
+		_hash_dropscanbuf(rel, so);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +312,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +333,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +454,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +671,266 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+						   OffsetNumber maxoff, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, relocate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..a4a03e0 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -463,6 +463,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer  buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -472,6 +475,7 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(BufferIsValid(so->currPos.buf));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -479,7 +483,23 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+	}
+
+	page = BufferGetPage(buf);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -511,6 +531,11 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..3b01e3e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,44 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -147,6 +185,12 @@ typedef struct HashScanOpaqueData
 	/* info about killed items if any (killedItems is NULL if never used) */
 	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 532287d11f4fd03e78a613cabb6751f8ab22fa59 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 22 Mar 2017 18:51:22 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2450ee1..3ac8154 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -841,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
-- 
1.8.3.1

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#12)
Re: Page Scan Mode in Hash Index

On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I am suspicious that _hash_kill_items() is going to have problems if
the overflow page is freed before it reacquires the lock.
_btkillitems() contains safeguards against similar cases.

I have added similar check for hash_kill_items() as well.

I think hash_kill_items has a much bigger problem which is that we
can't kill the items if the page has been modified after re-reading
it. Consider a case where Vacuum starts before the Scan on the bucket
and then Scan went ahead (which is possible after your 0003 patch) and
noted the killed items in killed item array and before it could kill
all the items, Vacuum remove those items. Now it is quite possible
that before scan tries to kill those items, the corresponding itemids
have been used by different tuples. I think what we need to do here
is to store the LSN of page first time when we have read the page and
then compare it with current page lsn after re-reading the page in
hash_kill_tems.

*
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
..
HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
  int numKilled; /* number of currently stored items */
+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */
 } HashScanOpaqueData;

After having array of HashScanPosItems as currPos.items, I think you
don't need array of HashScanPosItem for killedItems, just an integer
array of index in currPos.items should be sufficient.

*
I think below para in README is not valid after this patch.

"To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to
re-find
their current scan position after re-acquiring the buffer content lock
on page. Since deletion is not possible while a reader holds the pin
on bucket, and we assume that heap tuple TIDs are unique, this can be
implemented by searching for the same heap tuple TID previously
returned. Insertion does not move index entries across pages, so the
previously-returned index entry should always be on the same page, at
the same or higher offset number,
as it was before."

*
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_OVERFLOW_PAGE,
- bstrategy);
-

  /*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page.
  */
..
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+   LH_OVERFLOW_PAGE,
+   bstrategy);
+

After this change, you need to modify comments on top of function
hashbucketcleanup() and _hash_squeezebucket().

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#17)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

Thanks for the review.

I am suspicious that _hash_kill_items() is going to have problems if
the overflow page is freed before it reacquires the lock.
_btkillitems() contains safeguards against similar cases.

I have added similar check for hash_kill_items() as well.

I think hash_kill_items has a much bigger problem which is that we
can't kill the items if the page has been modified after re-reading
it. Consider a case where Vacuum starts before the Scan on the bucket
and then Scan went ahead (which is possible after your 0003 patch) and
noted the killed items in killed item array and before it could kill
all the items, Vacuum remove those items. Now it is quite possible
that before scan tries to kill those items, the corresponding itemids
have been used by different tuples. I think what we need to do here
is to store the LSN of page first time when we have read the page and
then compare it with current page lsn after re-reading the page in
hash_kill_tems.

Okay, understood. I have done the changes to handle this type of
scenario. Please refer to the attached patches. Thanks.

*
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
..
HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
int numKilled; /* number of currently stored items */
+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */
} HashScanOpaqueData;

After having array of HashScanPosItems as currPos.items, I think you
don't need array of HashScanPosItem for killedItems, just an integer
array of index in currPos.items should be sufficient.

Corrected.

*
I think below para in README is not valid after this patch.

"To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to
re-find
their current scan position after re-acquiring the buffer content lock
on page. Since deletion is not possible while a reader holds the pin
on bucket, and we assume that heap tuple TIDs are unique, this can be
implemented by searching for the same heap tuple TID previously
returned. Insertion does not move index entries across pages, so the
previously-returned index entry should always be on the same page, at
the same or higher offset number,
as it was before."

I have modified above paragraph in README file.

*
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_OVERFLOW_PAGE,
- bstrategy);
-

/*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page.
*/
..
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+   LH_OVERFLOW_PAGE,
+   bstrategy);
+

After this change, you need to modify comments on top of function
hashbucketcleanup() and _hash_squeezebucket().

Done.

Please note that these patches needs to be applied on top of [1]/messages/by-id/CAE9k0P=3rtgUtxopG+XQVxgASFzAnGd9sNmYUaj_=KeVsKGUdA@mail.gmail.com.

[1]: /messages/by-id/CAE9k0P=3rtgUtxopG+XQVxgASFzAnGd9sNmYUaj_=KeVsKGUdA@mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev6.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev6.patchDownload
From 52429bb8b8ecbacd499de51235c0396ab09b17d8 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 2 Apr 2017 03:38:00 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev6

Patch by Ashutosh
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 150 +++---------
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 450 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  51 +++-
 src/include/access/hash.h            |  48 +++-
 6 files changed, 552 insertions(+), 189 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..063656d 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -254,15 +255,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b835f77..582163a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * instead, we just forget any excess entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +366,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,23 +391,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -515,16 +424,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 61ca2ec..cd3c679 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,23 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
-	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
-	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf) &&
+		((so->hashso_bucket_buf == so->currPos.buf) ||
+		(so->hashso_split_bucket_buf == so->currPos.buf)))
+		_hash_dropbuf(rel, so->currPos.buf);
+
+	so->hashso_bucket_buf = InvalidBuffer;
+	so->hashso_split_bucket_buf = InvalidBuffer;
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..43f6e98 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,144 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, OffsetNumber maxoff,
+			ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	HashPageOpaque opaque;
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && !so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->currPos.buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+				/*
+				 * setting hashso_buc_split to true indicates that we are
+				 * scanning bucket being split.
+				 */
+				so->hashso_buc_split = true;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+				page = BufferGetPage(buf);
+				opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+				/* move to the end of bucket chain */
+				while (BlockNumberIsValid(opaque->hasho_nextblkno))
+					   _hash_readnext(scan, &buf, &page, &opaque);
+
+				/*
+				 * setting hashso_buc_split to false indicates that we are
+				 * scanning bucket being split.
+				 */
+
+				so->hashso_buc_split = false;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +312,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +333,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +454,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +671,275 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else returns false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	maxoff;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+	 * the pin for MVCC scans, which allows vacuum to avoid blocking.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, maxoff, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				maxoff = PageGetMaxOffsetNumber(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page, offnum,
+													   maxoff, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same. Also, remember the prev and
+				 * next block number so that if fetching tuples using
+				 * cursor we remember the page from where to start the
+				 * scan.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+		}
+
+		so->currPos.nextPage = (opaque)->hasho_nextblkno;
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+						   OffsetNumber maxoff, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		/* new page, relocate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 2e99719..1760446 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -463,6 +463,9 @@ void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer  buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -472,6 +475,8 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
+	Assert(BufferIsValid(so->currPos.buf));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -479,20 +484,53 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		if (BlockNumberIsValid(blkno))
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the last
+		 * read. killedItems could be not valid so LP_DEAD hints applying is not
+		 * safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int				itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId	iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -511,6 +549,11 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index eb1df57..04fd14f 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr  lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			*killedItems;		/* currPos.items indexes of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 21894a693904a6ec270906fee403880768ef3db5 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 2 Apr 2017 03:43:20 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v3

---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 063656d..a3f2445 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -380,8 +380,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 582163a..4c82868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -670,11 +670,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -843,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..dc119a3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -778,9 +778,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#18)
Re: Page Scan Mode in Hash Index

On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please note that these patches needs to be applied on top of [1].

Few more review comments:

1.
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }

Here, you are assuming that only bucket page can be pinned, but I
think that assumption is not right. When _hash_kill_items() is called
before moving to next page, there could be a pin on the overflow page.
You need some logic to check if the buffer is pinned, then just lock
it. I think once you do that, it might not be convinient to release
the pin at the end of this function.

Add some comments on top of _hash_kill_items to explain the new
changes or say some thing like "See _bt_killitems for details"

2.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+

The second part of above comment doesn't make sense because vacuum's
will still be blocked because we hold the pin on primary bucket page.

3.
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same. Also, remember the prev and
+ * next block number so that if fetching tuples using
+ * cursor we remember the page from where to start the
+ * scan.
+ */
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

You can't read opaque without having lock and by this time it has
already been released. Also, I think if you want to save position for
cursor movement, then you need to save the position of last bucket
when scan completes the overflow chain, however as you have written it
will be always invalid block number. I think there is similar problem
with prevblock number.

4.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple      itup;
+ int itemIndex;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ /* new page, relocate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);

What is the need to find offset number when this function already has
that as an input parameter?

5.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)

I think maxoff is not required to be passed a parameter to this
function as you need it only for forward scan. You can compute it
locally.

6.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
{
..
+ if (ScanDirectionIsForward(dir))
+ {
..
+ while (offnum <= maxoff)
{
..
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ offnum = OffsetNumberNext(offnum);
..

Why are you traversing the whole page even when there is no match?
There is a similar problem in backward scan. I think this is blunder.

7.
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ }
+
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

What makes you think it is safe read opaque after releasing the lock?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#12)
Re: Page Scan Mode in Hash Index

On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

My guess (which could be wrong) is that so->hashso_bucket_buf =

InvalidBuffer should be moved back up higher in the function where it
was before, just after the first if statement, and that the new
condition so->hashso_bucket_buf == so->currPos.buf on the last
_hash_dropbuf() should be removed. If that's not right, then I think
I need somebody to explain why not.

Okay, as i mentioned above, in case of page scan mode we only keep pin
on a bucket buf. There won't be any case where we will be having pin
on overflow buf at the end of scan.

What if mark the buffer as invalid after releasing the pin? We are
already doing it that way in btree code, refer
_bt_drop_lock_and_maybe_pin(). I think if we do that way, then we can
do what Robert is suggesting.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#20)
Re: Page Scan Mode in Hash Index

On Tue, Apr 4, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

My guess (which could be wrong) is that so->hashso_bucket_buf =

InvalidBuffer should be moved back up higher in the function where it
was before, just after the first if statement, and that the new
condition so->hashso_bucket_buf == so->currPos.buf on the last
_hash_dropbuf() should be removed. If that's not right, then I think
I need somebody to explain why not.

Okay, as i mentioned above, in case of page scan mode we only keep pin
on a bucket buf. There won't be any case where we will be having pin
on overflow buf at the end of scan.

What if mark the buffer as invalid after releasing the pin? We are
already doing it that way in btree code, refer
_bt_drop_lock_and_maybe_pin(). I think if we do that way, then we can
do what Robert is suggesting.

Please continue reviewing, but I think we're out of time to get this
patch into v10. This patch seems to be still under fairly heavy
revision, and we're only a couple of days from feature freeze, and the
patch upon which it depends (page-at-a-time vacuum) has had no fewer
than four follow-up commits repairing various problems with the logic,
with no guarantee that we've found all the bugs yet. In view of those
facts, I don't think it would be wise for me to commit this to v10, so
I'm instead going to move it to the next CommitFest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#19)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please note that these patches needs to be applied on top of [1].

Few more review comments:

1.
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }

Here, you are assuming that only bucket page can be pinned, but I
think that assumption is not right. When _hash_kill_items() is called
before moving to next page, there could be a pin on the overflow page.
You need some logic to check if the buffer is pinned, then just lock
it. I think once you do that, it might not be convinient to release
the pin at the end of this function.

Yes, there are few cases where we might have pin on overflow pages too
and we need to handle such cases in _hash_kill_items. I have taken
care of this in the attached v7 patch. Thanks.

Add some comments on top of _hash_kill_items to explain the new
changes or say some thing like "See _bt_killitems for details"

Added some more comments on the new changes.

2.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+

The second part of above comment doesn't make sense because vacuum's
will still be blocked because we hold the pin on primary bucket page.

That's right. It doesn't make sense because we won't allow vacuum to
start. I have removed it.

3.
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same. Also, remember the prev and
+ * next block number so that if fetching tuples using
+ * cursor we remember the page from where to start the
+ * scan.
+ */
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

You can't read opaque without having lock and by this time it has
already been released.

I have corrected it. Please refer to the attached v7 patch.

Also, I think if you want to save position for

cursor movement, then you need to save the position of last bucket
when scan completes the overflow chain, however as you have written it
will be always invalid block number. I think there is similar problem
with prevblock number.

Did you mean last bucket or last page. If it is last page, then I am
already storing it.

4.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple      itup;
+ int itemIndex;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ /* new page, relocate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);

What is the need to find offset number when this function already has
that as an input parameter?

It's not required. I have removed it.

5.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)

I think maxoff is not required to be passed a parameter to this
function as you need it only for forward scan. You can compute it
locally.

It is required for both forward and backward scan. However, I am not
passing it to _hash_load_qualified_items() now.

6.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
{
..
+ if (ScanDirectionIsForward(dir))
+ {
..
+ while (offnum <= maxoff)
{
..
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ offnum = OffsetNumberNext(offnum);
..

Why are you traversing the whole page even when there is no match?
There is a similar problem in backward scan. I think this is blunder.

Fixed. Please check the attached
'0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'

7.
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ }
+
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

What makes you think it is safe read opaque after releasing the lock?

Nothing makes me think to read opaque after releasing lock. It's a
mistake. I have corrected it. Please check attached v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patchDownload
From c500a917a881948ccd373dcf65942b796abb6dda Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Mon, 8 May 2017 18:21:03 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev7

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 156 +++--------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 491 ++++++++++++++++++++++++++++++++---
 src/backend/access/hash/hashutil.c   |  70 ++++-
 src/include/access/hash.h            |  50 +++-
 6 files changed, 609 insertions(+), 193 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3eb5b1d..4c60af1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * instead, we just forget any excess entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +366,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,23 +391,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -515,20 +423,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	if (so->killedItems != NULL)
 		pfree(so->killedItems);
 	pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 3cd4daa..5861b82 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..a108ba1 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,156 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	HashPageOpaque opaque;
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && !so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->currPos.buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+				/*
+				 * setting hashso_buc_split to true indicates that we are
+				 * scanning bucket being split.
+				 */
+				so->hashso_buc_split = true;
+
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf || buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else if (so->hashso_buc_populated && so->hashso_buc_split)
+			{
+				/*
+				 * end of bucket, scan bucket being populated if there was a
+				 * split in progress at the start of scan.
+				 */
+				buf = so->hashso_split_bucket_buf;
+				Assert(BufferIsValid(buf));
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+				page = BufferGetPage(buf);
+				opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+				/* move to the end of bucket chain */
+				while (BlockNumberIsValid(opaque->hasho_nextblkno))
+					   _hash_readnext(scan, &buf, &page, &opaque);
+
+				/*
+				 * setting hashso_buc_split to false indicates that we are
+				 * scanning bucket being split.
+				 */
+
+				so->hashso_buc_split = false;
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +324,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +345,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +466,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +683,304 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we
+	 * know whether it safe to apply LP_DEAD hints to the
+	 * page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the end of bucket chain without finding any
+			 * matching tuples. See comments in else part below.
+			 */
+			if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+			{
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+			}
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the bucket page without finding any matching
+			 * tuples. See comments in else part below.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+			{
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			}
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+	OffsetNumber	maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9f832f2..8053072 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by havePin.
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, bool havePin)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer  buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -538,6 +553,7 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -545,20 +561,58 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		if (!havePin)
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		else
+		{
+			buf = so->currPos.buf;
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the last
+		 * read. killedItems could be not valid so LP_DEAD hints applying is not
+		 * safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int				itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId	iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -577,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index adba224..ab9ebda 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr  lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			*killedItems;		/* currPos.items indexes of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -411,7 +457,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, bool havePin);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 21894a693904a6ec270906fee403880768ef3db5 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 2 Apr 2017 03:43:20 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v3

---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 063656d..a3f2445 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -380,8 +380,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 582163a..4c82868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -670,11 +670,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -843,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..dc119a3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -778,9 +778,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#23Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#22)
3 attachment(s)
Re: Page Scan Mode in Hash Index

While doing the code coverage testing of v7 patch shared with - [1]/messages/by-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg@mail.gmail.com, I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

[1]: /messages/by-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg@mail.gmail.com

Show quoted text

On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please note that these patches needs to be applied on top of [1].

Few more review comments:

1.
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }

Here, you are assuming that only bucket page can be pinned, but I
think that assumption is not right. When _hash_kill_items() is called
before moving to next page, there could be a pin on the overflow page.
You need some logic to check if the buffer is pinned, then just lock
it. I think once you do that, it might not be convinient to release
the pin at the end of this function.

Yes, there are few cases where we might have pin on overflow pages too
and we need to handle such cases in _hash_kill_items. I have taken
care of this in the attached v7 patch. Thanks.

Add some comments on top of _hash_kill_items to explain the new
changes or say some thing like "See _bt_killitems for details"

Added some more comments on the new changes.

2.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+

The second part of above comment doesn't make sense because vacuum's
will still be blocked because we hold the pin on primary bucket page.

That's right. It doesn't make sense because we won't allow vacuum to
start. I have removed it.

3.
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same. Also, remember the prev and
+ * next block number so that if fetching tuples using
+ * cursor we remember the page from where to start the
+ * scan.
+ */
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

You can't read opaque without having lock and by this time it has
already been released.

I have corrected it. Please refer to the attached v7 patch.

Also, I think if you want to save position for

cursor movement, then you need to save the position of last bucket
when scan completes the overflow chain, however as you have written it
will be always invalid block number. I think there is similar problem
with prevblock number.

Did you mean last bucket or last page. If it is last page, then I am
already storing it.

4.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple      itup;
+ int itemIndex;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ /* new page, relocate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);

What is the need to find offset number when this function already has
that as an input parameter?

It's not required. I have removed it.

5.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)

I think maxoff is not required to be passed a parameter to this
function as you need it only for forward scan. You can compute it
locally.

It is required for both forward and backward scan. However, I am not
passing it to _hash_load_qualified_items() now.

6.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
{
..
+ if (ScanDirectionIsForward(dir))
+ {
..
+ while (offnum <= maxoff)
{
..
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ offnum = OffsetNumberNext(offnum);
..

Why are you traversing the whole page even when there is no match?
There is a similar problem in backward scan. I think this is blunder.

Fixed. Please check the attached
'0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'

7.
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ }
+
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

What makes you think it is safe read opaque after releasing the lock?

Nothing makes me think to read opaque after releasing lock. It's a
mistake. I have corrected it. Please check attached v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From 21894a693904a6ec270906fee403880768ef3db5 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 2 Apr 2017 03:43:20 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v3

---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 063656d..a3f2445 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -380,8 +380,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 582163a..4c82868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -670,11 +670,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -843,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..dc119a3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -778,9 +778,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

0001-Rewrite-hash-index-scan-to-work-page-at-a-timev8.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-timev8.patchDownload
From 3b4cad90446e29613b3574b64d9361774a0f7210 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 10 May 2017 12:25:57 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a timev8

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 156 +++---------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 445 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  70 +++++-
 src/include/access/hash.h            |  50 +++-
 6 files changed, 562 insertions(+), 194 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index df54638..a34b812 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * instead, we just forget any excess entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +366,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,23 +391,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -515,20 +423,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	if (so->killedItems != NULL)
 		pfree(so->killedItems);
 	pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index bf1ffff..78141e2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..13b1d8d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,108 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +276,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +418,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +635,304 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we
+	 * know whether it safe to apply LP_DEAD hints to the
+	 * page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the end of bucket chain without finding any
+			 * matching tuples. See comments in else part below.
+			 */
+			if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+			{
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+			}
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the bucket page without finding any matching
+			 * tuples. See comments in else part below.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+			{
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			}
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+	OffsetNumber	maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9f832f2..e5c3b1e 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by havePin.
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, bool havePin)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer  buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -538,6 +553,7 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -545,20 +561,58 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		if (!havePin)
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		else
+		{
+			buf = so->currPos.buf;
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the
+		 * last read. killedItems could be not valid so LP_DEAD hints apply-
+		 * ing is not safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int				itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId	iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -577,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index adba224..ab9ebda 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr  lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			*killedItems;		/* currPos.items indexes of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -411,7 +457,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, bool havePin);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

#24Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#23)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Hi,

On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

Please find the new version of patches for page at a time scan in hash
index rebased on top of latest commit in master branch. Also, i have
ran pgindent script with pg_bsd_indent version 2.0 on all the modified
files. Thanks.

Show quoted text

[1] - /messages/by-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg@mail.gmail.com

On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Please note that these patches needs to be applied on top of [1].

Few more review comments:

1.
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }

Here, you are assuming that only bucket page can be pinned, but I
think that assumption is not right. When _hash_kill_items() is called
before moving to next page, there could be a pin on the overflow page.
You need some logic to check if the buffer is pinned, then just lock
it. I think once you do that, it might not be convinient to release
the pin at the end of this function.

Yes, there are few cases where we might have pin on overflow pages too
and we need to handle such cases in _hash_kill_items. I have taken
care of this in the attached v7 patch. Thanks.

Add some comments on top of _hash_kill_items to explain the new
changes or say some thing like "See _bt_killitems for details"

Added some more comments on the new changes.

2.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+

The second part of above comment doesn't make sense because vacuum's
will still be blocked because we hold the pin on primary bucket page.

That's right. It doesn't make sense because we won't allow vacuum to
start. I have removed it.

3.
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same. Also, remember the prev and
+ * next block number so that if fetching tuples using
+ * cursor we remember the page from where to start the
+ * scan.
+ */
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

You can't read opaque without having lock and by this time it has
already been released.

I have corrected it. Please refer to the attached v7 patch.

Also, I think if you want to save position for

cursor movement, then you need to save the position of last bucket
when scan completes the overflow chain, however as you have written it
will be always invalid block number. I think there is similar problem
with prevblock number.

Did you mean last bucket or last page. If it is last page, then I am
already storing it.

4.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple      itup;
+ int itemIndex;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ /* new page, relocate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);

What is the need to find offset number when this function already has
that as an input parameter?

It's not required. I have removed it.

5.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)

I think maxoff is not required to be passed a parameter to this
function as you need it only for forward scan. You can compute it
locally.

It is required for both forward and backward scan. However, I am not
passing it to _hash_load_qualified_items() now.

6.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
{
..
+ if (ScanDirectionIsForward(dir))
+ {
..
+ while (offnum <= maxoff)
{
..
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ offnum = OffsetNumberNext(offnum);
..

Why are you traversing the whole page even when there is no match?
There is a similar problem in backward scan. I think this is blunder.

Fixed. Please check the attached
'0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'

7.
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ }
+
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

What makes you think it is safe read opaque after releasing the lock?

Nothing makes me think to read opaque after releasing lock. It's a
mistake. I have corrected it. Please check attached v7 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-timev9.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-timev9.patchDownload
From a155bed1a1510d030f2a004a91002274139d3b9a Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Sun, 30 Jul 2017 11:48:38 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a timev9

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 153 +++---------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 446 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  71 +++++-
 src/include/access/hash.h            |  50 +++-
 6 files changed, 561 insertions(+), 194 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..2b858f0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem *currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +365,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +390,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	if (so->killedItems != NULL)
 		pfree(so->killedItems);
 	pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index d5b6502..90e1e55 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..85ab86d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,108 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more, try to
+	 * read data from next or prev page based on the scan direction. Before
+	 * moving to the next or prev page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +276,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +418,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -467,7 +527,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 					/* Before leaving current page, deal with any killed items */
 					if (so->numKilled > 0)
-						_hash_kill_items(scan);
+						_hash_kill_items(scan, true);
 
 					/*
 					 * ran off the end of this page, try the next
@@ -524,7 +584,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 					/* Before leaving current page, deal with any killed items */
 					if (so->numKilled > 0)
-						_hash_kill_items(scan);
+						_hash_kill_items(scan, true);
 
 					/*
 					 * ran off the end of this page, try the next
@@ -575,3 +635,301 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with current block
+			 * number so that if fetching the tup- les using cursor we know
+			 * the page from where to startwith. This is for the case where we
+			 * have re- ached the end of bucket chain without finding any
+			 * matching tuples. See comments in else part below.
+			 */
+			if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+			{
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+			}
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE indicating
+				 * the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the prev page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with current block
+			 * number so that if fetching the tup- les using cursor we know
+			 * the page from where to startwith. This is for the case where we
+			 * have re- ached the bucket page without finding any matching
+			 * tuples. See comments in else part below.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+			{
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			}
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE indicating
+				 * the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..5ae8713 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by havePin.
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, bool havePin)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -539,6 +554,7 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +562,59 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		if (!havePin)
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		else
+		{
+			buf = so->currPos.buf;
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the
+		 * last read. killedItems could be not valid so LP_DEAD hints apply-
+		 * ing is not safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +633,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 7fa868b..9335679 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -411,7 +457,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, bool havePin);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
-- 
1.8.3.1

0002-Remove-redundant-function-_hash_step-and-some-of-the.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-function-_hash_step-and-some-of-the.patchDownload
From 72f36b0fea59114b73a38622f34fb8d4e35a4731 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Sun, 30 Jul 2017 12:49:34 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 works page at a time which means it reads all the qualifing tuples in a page
 at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 85ab86d..2aa0b29 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan, true);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan, true);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9335679..c11e87e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -152,14 +152,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -170,12 +162,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -426,7 +412,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From e09ccbce2aa3388db37ec72e3c02e9593bbed4f9 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Sun, 30 Jul 2017 12:37:24 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v4

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index eef7d66..34a84ce 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2b858f0..3d68af5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -663,11 +663,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -836,19 +834,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..3a7011d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -790,9 +790,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#24)
Re: Page Scan Mode in Hash Index

On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

Please find the new version of patches for page at a time scan in hash
index rebased on top of latest commit in master branch. Also, i have
ran pgindent script with pg_bsd_indent version 2.0 on all the modified
files. Thanks.

Few comments:
1.
+_hash_kill_items(IndexScanDesc scan, bool havePin)

I think you can do without the second parameter. Can't we detect
inside _hash_kill_items whether the page is pinned or not as we do for
btree?

2.
+ /*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tup- les using cursor we know
+ * the page from where to startwith. This is for the case where we
+ * have re- ached the end of bucket chain without finding any
+ * matching tuples.

The spelling of tuples and reached contain some unwanted symbol. Have
space between "startwith" or just use "begin"

3.
+ if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = InvalidBlockNumber;
+ }

There is no need to use Parentheses around opaque. I mean there is no
problem with that, but it is redundant and makes code less readable.
Also, there is similar usage at other places in the code, please
change all another place as well. I think you can save the value of
prevblkno in a local variable and use it in else part.

4.
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+
+ /*
+ * No more matching tuples exist in this page. so, exit while
+ * loop.
+ */
+ break;

It is better to have braces for the else part. It makes code look
better. The type of code exists few line down as well, change that as
well.

5.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.
+ */

"whether it safe"/"whether it is safe"

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#25)
Re: Page Scan Mode in Hash Index

On Fri, Aug 4, 2017 at 9:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

There is no need to use Parentheses around opaque. I mean there is no
problem with that, but it is redundant and makes code less readable.

Amit, I'm sure you know this, but just for the benefit of anyone who doesn't:

We often include these kinds of extra parentheses in macros, for good
reason. Suppose you have:

#define mul(x,y) x * y

If the user says mul(2+3,5), it will expand to 2 + 3 * 5 = 17, which
is wrong. If you instead do this:

#define mul(x,y) (x) * (y)

...then mul(2+3,5) expands to (2 + 3) * (5) = 25, which is what the
user of the macro is expecting to get.

Outside of macro definitions, as you say, there's no point and we
should avoid it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#25)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Fri, Aug 4, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

Please find the new version of patches for page at a time scan in hash
index rebased on top of latest commit in master branch. Also, i have
ran pgindent script with pg_bsd_indent version 2.0 on all the modified
files. Thanks.

Few comments:

Thanks for reviewing the patch.

1.
+_hash_kill_items(IndexScanDesc scan, bool havePin)

I think you can do without the second parameter. Can't we detect
inside _hash_kill_items whether the page is pinned or not as we do for
btree?

Okay, done that way. Please check attached v10 patch.

2.
+ /*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tup- les using cursor we know
+ * the page from where to startwith. This is for the case where we
+ * have re- ached the end of bucket chain without finding any
+ * matching tuples.

The spelling of tuples and reached contain some unwanted symbol. Have
space between "startwith" or just use "begin"

Corrected.

3.
+ if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = InvalidBlockNumber;
+ }

There is no need to use Parentheses around opaque. I mean there is no
problem with that, but it is redundant and makes code less readable.
Also, there is similar usage at other places in the code, please
change all another place as well.

Removed parenthesis around opaque.

I think you can save the value of

prevblkno in a local variable and use it in else part.

Okay, done that way.

4.
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+
+ /*
+ * No more matching tuples exist in this page. so, exit while
+ * loop.
+ */
+ break;

It is better to have braces for the else part. It makes code look
better. The type of code exists few line down as well, change that as
well.

Added braces in the else part.

5.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.
+ */

"whether it safe"/"whether it is safe"

Corrected.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v10.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v10.patchDownload
From 59e9a5f5afc31a3d14ae39bf5ae0cf21ee42f624 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:06:52 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 153 +++---------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 446 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  71 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 570 insertions(+), 190 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..08bfd6a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem *currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +365,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +390,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	if (so->killedItems != NULL)
 		pfree(so->killedItems);
 	pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 08eaf1d..7a02e94 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..f4408ab 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,108 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more, try to
+	 * read data from next or prev page based on the scan direction. Before
+	 * moving to the next or prev page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +276,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +418,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +635,305 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			/*
+			 * We remember prev and next block number along with current block
+			 * number so that if fetching the tuples using cursor we know the
+			 * page from where to begin. This is for the case where we have
+			 * reached the end of bucket chain without finding any matching
+			 * tuples.
+			 */
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE indicating
+				 * the same.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = opaque->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = opaque->hasho_prevblkno;
+			so->currPos.nextPage = opaque->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the prev page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			/*
+			 * We remember prev and next block number along with current block
+			 * number so that if fetching the tuples using cursor we know the
+			 * page from where to begin. This is for the case where we have
+			 * reached the bucket page without finding any matching tuples.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE indicating
+				 * the same.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = opaque->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = opaque->hasho_prevblkno;
+			so->currPos.nextPage = opaque->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..bbc4296 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +551,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +563,60 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it. The pin would have prevented	re-use of any TID
+		 * on the page, so there is no need to check the LSN.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the
+		 * last read. killedItems could be not valid so LP_DEAD hints apply-
+		 * ing is not safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +635,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 72fce30..3e90b89 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +192,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patchDownload
From e09ccbce2aa3388db37ec72e3c02e9593bbed4f9 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Sun, 30 Jul 2017 12:37:24 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v4

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index eef7d66..34a84ce 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2b858f0..3d68af5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -663,11 +663,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -836,19 +834,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..3a7011d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -790,9 +790,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#27)
Re: Page Scan Mode in Hash Index

On Mon, Aug 7, 2017 at 5:50 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Fri, Aug 4, 2017 at 7:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jul 30, 2017 at 2:07 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi,

On Wed, May 10, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

Please find the new version of patches for page at a time scan in hash
index rebased on top of latest commit in master branch. Also, i have
ran pgindent script with pg_bsd_indent version 2.0 on all the modified
files. Thanks.

Few comments:

Thanks for reviewing the patch.

Comments on the latest patch:

1.
/*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tuples using cursor we know the
+ * page from where to begin. This is for the case where we have
+ * reached the end of bucket chain without finding any matching
+ * tuples.
+ */
+ if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+ prev_blkno = opaque->hasho_prevblkno;

This doesn't seem to be the right place for this comment as you are
not saving next or current block number here. I am not sure whether
you really need this comment at all. Will it be better if you move
this to else part of the loop and reword it as:

"Remember next and previous block numbers for scrollable cursors to
know the start position."

2. The code in _hash_readpage looks quite bloated.  I think we can
make it better if we do something like below.
a.
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
..
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ so->currPos.nextPage = opaque->hasho_nextblkno;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = opaque->hasho_prevblkno;
+ so->currPos.nextPage = opaque->hasho_nextblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
..
}

This code chunk is same for both forward and backward scans. I think
you can have single copy of this code by moving it out of if-else
loop.

b.
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
..
+ /* new page, locate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+ while (itemIndex == 0)
+ {
+ /*
+ * Could not find any matching tuples in the current page, move to
+ * the next page. Before leaving the current page, also deal with
+ * any killed items.
+ */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan);
+
+ /*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tuples using cursor we know the
+ * page from where to begin. This is for the case where we have
+ * reached the end of bucket chain without finding any matching
+ * tuples.
+ */
+ if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+ prev_blkno = opaque->hasho_prevblkno;
+
+ _hash_readnext(scan, &buf, &page, &opaque);
+ if (BufferIsValid(buf))
+ {
+ so->currPos.buf = buf;
+ so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.lsn = PageGetLSN(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+ itemIndex = _hash_load_qualified_items(scan, page,
+   offnum, dir);
..
}

Have just one copy of code search the offset and load qualified items.
Change the above while loop to do..while loop and have a check in
between break the loop when item index is not zero.

3.
- * Find the first item in the index that
- * satisfies the qualification associated with the scan descriptor. On
- * success, the page containing the current index tuple is read locked
- * and pinned, and the scan's opaque data entry is updated to
- * include the buffer.
+ * We find the first item (or, if backward scan, the last item) in
+ * the index that satisfies the qualification associated with the
+ * scan descriptor. On success, the page containing the current
+ * index tuple is read locked and pinned, and data about the
+ * matching tuple(s) on the page has been loaded into so->currPos,
+ * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ * If there are no matching items in the index, we return FALSE,
+ * with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)

I don't think on success, the lock or pin is held anymore, after this
patch the only pin will be retained and that too for bucket page.
Also, there is no need to modify the part of the comment which is not
related to change in this patch.

I don't see scan->xs_ctup.t_self getting set anywhere in this
function. I think you are setting it in hashgettuple. It is better
to move that assignment from hashgettuple to _hash_first so as to be
consistent with _hash_next.

4.
+ * On successful exit, scan->xs_ctup.t_self is set to the TID
+ * of the next heap tuple, and if requested, scan->xs_itup
+ * points to a copy of the index tuple.  so->currPos is updated
+ * as needed.
+ *
+ * On failure exit (no more tuples), we return FALSE with no
+ * pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)

I don't see the usage of scan->xs_itup in this function. It seems to
me that you have copied it from btree code and forgot to remove part
of the comment which is not relevant to a hash index.

5.
@@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
{
..
+ if (HashScanPosIsValid(so->currPos))
  {
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
- _hash_kill_items(scan);
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+ /* Before leaving current page, deal with any killed items */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan);
+ _hash_dropscanbuf(rel, so);
  }

- _hash_dropscanbuf(rel, so);
-
..
}

I don't think it is a good idea to move _hash_dropscanbuf as that
check just ensures if the current buffer is valid. It doesn't say
anything about other buffers saved in HashScanOpaque. A similar
change is required in hashrescan.

6.
_hash_kill_items(IndexScanDesc scan)
{
..
+ if (so->hashso_bucket_buf == so->currPos.buf ||
+ HashScanPosIsPinned(so->currPos))
..
}

Isn't second check enough? It should indirectly cover the first test.

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#28)
Re: Page Scan Mode in Hash Index

On Wed, Aug 9, 2017 at 2:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 7, 2017 at 5:50 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

I have thought about it and I think we can't use the technique btree
is using (not to release the pin on the page) to save unlogged or
temporary relations. It works for btree because it takes a cleanup
lock on each page before removing items from each page whereas in hash
index we take cleanup lock only on primary bucket page. Now, one
thing we could do is to start taking a cleanup lock on each bucket
page (which includes primary bucket page and overflow pages), but I
think it can turn out to be worse than the current locking strategy.
Another idea could be to preserve the current locking strategy (take
the lock on next bucket page and then release the lock on current
bucket page) during vacuum of the unlogged hash index. That will
ensure vacuum won't be able to remove the TIDs which we are going to
mark as dead.

Thoughts?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#28)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Comments on the latest patch:

Thank you once again for reviewing my patches.

1.
/*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tuples using cursor we know the
+ * page from where to begin. This is for the case where we have
+ * reached the end of bucket chain without finding any matching
+ * tuples.
+ */
+ if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+ prev_blkno = opaque->hasho_prevblkno;

This doesn't seem to be the right place for this comment as you are
not saving next or current block number here. I am not sure whether
you really need this comment at all. Will it be better if you move
this to else part of the loop and reword it as:

"Remember next and previous block numbers for scrollable cursors to
know the start position."

Shifted the comments to else part of the loop.

2. The code in _hash_readpage looks quite bloated.  I think we can
make it better if we do something like below.
a.
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
..
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ so->currPos.nextPage = opaque->hasho_nextblkno;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = opaque->hasho_prevblkno;
+ so->currPos.nextPage = opaque->hasho_nextblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ so->currPos.buf = InvalidBuffer;
+ }
..
}

This code chunk is same for both forward and backward scans. I think
you can have single copy of this code by moving it out of if-else
loop.

Done.

b.
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
..
+ /* new page, locate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+ while (itemIndex == 0)
+ {
+ /*
+ * Could not find any matching tuples in the current page, move to
+ * the next page. Before leaving the current page, also deal with
+ * any killed items.
+ */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan);
+
+ /*
+ * We remember prev and next block number along with current block
+ * number so that if fetching the tuples using cursor we know the
+ * page from where to begin. This is for the case where we have
+ * reached the end of bucket chain without finding any matching
+ * tuples.
+ */
+ if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+ prev_blkno = opaque->hasho_prevblkno;
+
+ _hash_readnext(scan, &buf, &page, &opaque);
+ if (BufferIsValid(buf))
+ {
+ so->currPos.buf = buf;
+ so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.lsn = PageGetLSN(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+ itemIndex = _hash_load_qualified_items(scan, page,
+   offnum, dir);
..
}

Have just one copy of code search the offset and load qualified items.
Change the above while loop to do..while loop and have a check in
between break the loop when item index is not zero.

Done that way.

3.
- * Find the first item in the index that
- * satisfies the qualification associated with the scan descriptor. On
- * success, the page containing the current index tuple is read locked
- * and pinned, and the scan's opaque data entry is updated to
- * include the buffer.
+ * We find the first item (or, if backward scan, the last item) in
+ * the index that satisfies the qualification associated with the
+ * scan descriptor. On success, the page containing the current
+ * index tuple is read locked and pinned, and data about the
+ * matching tuple(s) on the page has been loaded into so->currPos,
+ * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ * If there are no matching items in the index, we return FALSE,
+ * with no pins or locks held.
*/
bool
_hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)

I don't think on success, the lock or pin is held anymore, after this
patch the only pin will be retained and that too for bucket page.
Also, there is no need to modify the part of the comment which is not
related to change in this patch.

Corrected.

I don't see scan->xs_ctup.t_self getting set anywhere in this
function. I think you are setting it in hashgettuple. It is better
to move that assignment from hashgettuple to _hash_first so as to be
consistent with _hash_next.

Done that way.

4.
+ * On successful exit, scan->xs_ctup.t_self is set to the TID
+ * of the next heap tuple, and if requested, scan->xs_itup
+ * points to a copy of the index tuple.  so->currPos is updated
+ * as needed.
+ *
+ * On failure exit (no more tuples), we return FALSE with no
+ * pins or locks held.
*/
bool
_hash_next(IndexScanDesc scan, ScanDirection dir)

I don't see the usage of scan->xs_itup in this function. It seems to
me that you have copied it from btree code and forgot to remove part
of the comment which is not relevant to a hash index.

Corrected.

5.
@@ -514,19 +422,14 @@ hashendscan(IndexScanDesc scan)
{
..
+ if (HashScanPosIsValid(so->currPos))
{
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
- _hash_kill_items(scan);
- LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+ /* Before leaving current page, deal with any killed items */
+ if (so->numKilled > 0)
+ _hash_kill_items(scan);
+ _hash_dropscanbuf(rel, so);
}

- _hash_dropscanbuf(rel, so);
-
..
}

I don't think it is a good idea to move _hash_dropscanbuf as that
check just ensures if the current buffer is valid. It doesn't say
anything about other buffers saved in HashScanOpaque. A similar
change is required in hashrescan.

Done.

6.
_hash_kill_items(IndexScanDesc scan)
{
..
+ if (so->hashso_bucket_buf == so->currPos.buf ||
+ HashScanPosIsPinned(so->currPos))
..
}

Isn't second check enough? It should indirectly cover the first test.

Yes, one check should be fine. Corrected it.

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

Thanks for putting that point. It doesn't cover the case for unlogged
tables. As suggested by you in one of your email in this mailing list, i am
now not allowing vacuum to release lock on current page before acquiring
lock on next page for unlogged tables. This will ensure that scan is always
behind vacuum if they are running on the same bucket simultaneously.
Therefore, there is danger in marking tuples as dead for unlogged pages
even if they are not having any lsn.

With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v11.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v11.patchDownload
From 818cb83a05ab114712e035f9d33bd3072352ff7d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Fri, 11 Aug 2017 17:02:29 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 426 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  69 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 544 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..45a3a5a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 08eaf1d..7a02e94 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..c3a1514 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,107 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		pin held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more, try to
+	 * read data from next or prev page based on the scan direction. Before
+	 * moving to the next or prev page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +275,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +299,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +421,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +642,280 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the prev page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..2b8da90 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +551,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +563,58 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the
+		 * last read. killedItems could be not valid so LP_DEAD hints apply-
+		 * ing is not safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +633,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 72fce30..3e90b89 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +192,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5.patchDownload
From 99c1c7846b97446c48bb1ca262218d83b5dd5bf1 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Fri, 11 Aug 2017 17:59:47 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index eef7d66..34a84ce 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..3a7011d 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -790,9 +790,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#30)
Re: Page Scan Mode in Hash Index

On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

Thanks for putting that point. It doesn't cover the case for unlogged
tables. As suggested by you in one of your email in this mailing list, i am
now not allowing vacuum to release lock on current page before acquiring
lock on next page for unlogged tables. This will ensure that scan is always
behind vacuum if they are running on the same bucket simultaneously.
Therefore, there is danger in marking tuples as dead for unlogged pages even
if they are not having any lsn.

In the last line, I guess you wanted to say "there is *no* danger
..."? Today, while again thinking about this part of the patch
(_hash_kill_items) it occurred to me that we can't rely on a pin on an
overflow page to guarantee that it is not modified by Vacuum.
Consider a case where vacuum started vacuuming the bucket before the
scan and then in-between scan overtakes it. Now, it is possible that
even if a scan has a pin on a page (and no lock), vacuum might clean
that page, if that happens, then we can't prevent the reuse of TID.
What do you think?

Few other comments:

1.
+ * On failure exit (no more tuples), we return FALSE with pin
+ * pin held on bucket page but no pins or locks held on overflow
+ * page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)

In the above part of comment 'pin' is used twice.

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5

2.
- * not at all by the rearrangement we are performing here.  To prevent
- * any concurrent scan to cross the squeeze scan we use lock chaining
- * similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ * not at all by the rearrangement we are performing here.

In _hash_squeezebucket, we still use lock chaining, so removing the
above comment doesn't seem like a good idea. I think you should copy
part of a comment from hasbucketcleanup starting from "There can't be
any concurrent .."

3.
_hash_freeovflpage()
{
..

* Concurrency issues are avoided by using lock chaining as
* described atop hashbucketcleanup.
..
}

After fixing #2, you need to change the function name in above comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#31)
Re: Page Scan Mode in Hash Index

On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

Thanks for putting that point. It doesn't cover the case for unlogged
tables. As suggested by you in one of your email in this mailing list, i am
now not allowing vacuum to release lock on current page before acquiring
lock on next page for unlogged tables. This will ensure that scan is always
behind vacuum if they are running on the same bucket simultaneously.
Therefore, there is danger in marking tuples as dead for unlogged pages even
if they are not having any lsn.

Once again, Thank you for reviewing my patches.

In the last line, I guess you wanted to say "there is *no* danger
..."?

Yes, i meant that because, it ensures that scan will always be following VACUUM.

Today, while again thinking about this part of the patch

(_hash_kill_items) it occurred to me that we can't rely on a pin on an
overflow page to guarantee that it is not modified by Vacuum.
Consider a case where vacuum started vacuuming the bucket before the
scan and then in-between scan overtakes it. Now, it is possible that
even if a scan has a pin on a page (and no lock), vacuum might clean
that page, if that happens, then we can't prevent the reuse of TID.
What do you think?

I think, you are talking about non-mvcc scan case, because in case of
mvcc scans, even if we have released both pin and lock on a page,
VACUUM can't remove tuples from a page if it is visible to some
concurrently running transactions (mvcc scan in our case). So, i don't
think it can happen in case of MVCC scans however it can happen for
non-mvcc scans and for that to handle i think, it is better that we
drop an idea of allowing scan to overtake
VACUUM (done by
0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5.patch).

However, B-Tree has handled this in _bt_drop_lock_and_maybe_pin()
where it releases both pin and lock on a page if it is MVCC snapshot
else just
releases lock on the page.

Few other comments:

1.
+ * On failure exit (no more tuples), we return FALSE with pin
+ * pin held on bucket page but no pins or locks held on overflow
+ * page.
*/
bool
_hash_next(IndexScanDesc scan, ScanDirection dir)

In the above part of comment 'pin' is used twice.

OKay, I will remove one extra pin (from comment) in my next version of patch.

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v5

2.
- * not at all by the rearrangement we are performing here.  To prevent
- * any concurrent scan to cross the squeeze scan we use lock chaining
- * similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ * not at all by the rearrangement we are performing here.

In _hash_squeezebucket, we still use lock chaining, so removing the
above comment doesn't seem like a good idea. I think you should copy
part of a comment from hasbucketcleanup starting from "There can't be
any concurrent .."

Okay, I will correct it in my next version of patch.

3.
_hash_freeovflpage()
{
..

* Concurrency issues are avoided by using lock chaining as
* described atop hashbucketcleanup.
..
}

After fixing #2, you need to change the function name in above comment.

Sure, I will correct that in my next version of patch.

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#32)
Re: Page Scan Mode in Hash Index

On Tue, Aug 22, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

Thanks for putting that point. It doesn't cover the case for unlogged
tables. As suggested by you in one of your email in this mailing list, i am
now not allowing vacuum to release lock on current page before acquiring
lock on next page for unlogged tables. This will ensure that scan is always
behind vacuum if they are running on the same bucket simultaneously.
Therefore, there is danger in marking tuples as dead for unlogged pages even
if they are not having any lsn.

Once again, Thank you for reviewing my patches.

In the last line, I guess you wanted to say "there is *no* danger
..."?

Yes, i meant that because, it ensures that scan will always be following VACUUM.

Today, while again thinking about this part of the patch

(_hash_kill_items) it occurred to me that we can't rely on a pin on an
overflow page to guarantee that it is not modified by Vacuum.
Consider a case where vacuum started vacuuming the bucket before the
scan and then in-between scan overtakes it. Now, it is possible that
even if a scan has a pin on a page (and no lock), vacuum might clean
that page, if that happens, then we can't prevent the reuse of TID.
What do you think?

I think, you are talking about non-mvcc scan case, because in case of
mvcc scans, even if we have released both pin and lock on a page,
VACUUM can't remove tuples from a page if it is visible to some
concurrently running transactions (mvcc scan in our case).

I am talking about tuples that are marked as dead in heap. It has
nothing to do with the visibility of tuple or type of scan (mvcc or
non-mvcc).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#33)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 22, 2017 at 2:28 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Sat, Aug 19, 2017 at 11:37 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Aug 11, 2017 at 6:51 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

7.
_hash_kill_items(IndexScanDesc scan)
{
..
+ /*
+ * If page LSN differs it means that the page was modified since the
+ * last read. killedItems could be not valid so LP_DEAD hints apply-
+ * ing is not safe.
+ */
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) != so->currPos.lsn)
+ {
+ _hash_relbuf(rel, buf);
+ return;
+ }
..
}

How does this check cover the case of unlogged tables?

Thanks for putting that point. It doesn't cover the case for unlogged
tables. As suggested by you in one of your email in this mailing list, i am
now not allowing vacuum to release lock on current page before acquiring
lock on next page for unlogged tables. This will ensure that scan is always
behind vacuum if they are running on the same bucket simultaneously.
Therefore, there is danger in marking tuples as dead for unlogged pages even
if they are not having any lsn.

Once again, Thank you for reviewing my patches.

In the last line, I guess you wanted to say "there is *no* danger
..."?

Yes, i meant that because, it ensures that scan will always be following VACUUM.

Today, while again thinking about this part of the patch

(_hash_kill_items) it occurred to me that we can't rely on a pin on an
overflow page to guarantee that it is not modified by Vacuum.
Consider a case where vacuum started vacuuming the bucket before the
scan and then in-between scan overtakes it. Now, it is possible that
even if a scan has a pin on a page (and no lock), vacuum might clean
that page, if that happens, then we can't prevent the reuse of TID.
What do you think?

I think, you are talking about non-mvcc scan case, because in case of
mvcc scans, even if we have released both pin and lock on a page,
VACUUM can't remove tuples from a page if it is visible to some
concurrently running transactions (mvcc scan in our case).

I am talking about tuples that are marked as dead in heap. It has
nothing to do with the visibility of tuple or type of scan (mvcc or
non-mvcc).

Okay, I got your point now. I think, currently in _hash_kill_items(),
if an overflow page is pinned we do not check if it got modified since
the last read or
not. Hence, if the vacuum runs on an overflow page that is pinned and
also has some dead tuples in it then it could create a problem for
scan basically,
when scan would attempt to mark the killed items as dead. To get rid
of such problem, i think, even if an overflow page is pinned we should
check if it got
modified or not since the last read was performed on the page. If yes,
then do not allow scan to mark killed items as dead. Attached is the
newer version with these changes along with some other cosmetic
changes mentioned in your earlier email. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v12.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v12.patchDownload
From cb724914ebb56c9b525e165b0117bfa70aac7692 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Tue, 22 Aug 2017 18:53:47 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 426 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  72 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 547 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..45a3a5a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 7b2906b..e050a2a 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..8065fa8 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,107 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more, try to
+	 * read data from next or prev page based on the scan direction. Before
+	 * moving to the next or prev page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +275,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +299,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +421,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +642,280 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the prev page. Before leaving the current page, also deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..b0add62 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +551,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +563,61 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+	}
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so LP_DEAD hints apply- ing is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +636,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 72fce30..3e90b89 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +192,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v6.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v6.patchDownload
From 35f5e8116654a1dd63844c0ff0800f9a9ce07fec Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Tue, 22 Aug 2017 18:40:31 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c | 13 +++++++----
 3 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index eef7d66..34a84ce 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Sharma (#34)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Tue, Aug 22, 2017 at 7:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I got your point now. I think, currently in _hash_kill_items(),
if an overflow page is pinned we do not check if it got modified since
the last read or
not. Hence, if the vacuum runs on an overflow page that is pinned and
also has some dead tuples in it then it could create a problem for
scan basically,
when scan would attempt to mark the killed items as dead. To get rid
of such problem, i think, even if an overflow page is pinned we should
check if it got
modified or not since the last read was performed on the page. If yes,
then do not allow scan to mark killed items as dead. Attached is the
newer version with these changes along with some other cosmetic
changes mentioned in your earlier email. Thanks.

Thanks for the new version. I again looked at the patches and fixed
quite a few comments in the code and ReadMe. You have forgotten to
update README for the changes in vacuum patch
(0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I
don't have anything more to add. If you are okay with changes, then
we can move it to Ready For Committer unless someone else has some
more comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v13.patchapplication/octet-stream; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v13.patchDownload
From a7d2c4da29cf3e07f6a928ea45a58fb69ed42590 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 15:58:14 +0530
Subject: [PATCH 1/3] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 426 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  74 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 549 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..4465605 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+	the pin on the primary bucket throughout the scan)
+	save all the matching tuples from current index page into an items array
+	release pin and content lock (but if it is primary bucket page retain
+	it's pin till the end of scan)
+	get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..45a3a5a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 7b2906b..e050a2a 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..0c3338a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,107 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more, try to
+	 * read data from next or previous page based on the scan direction.
+	 * Before moving to the next or previous page make sure that we deal with
+	 * all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +275,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +299,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +421,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +642,280 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satisfies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, deal with any
+			 * killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the previous page. Before leaving the current page, deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..3bf1b97 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,30 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep the pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from that page.  However,
+ * having pin on the overflow page doesn't guarantee that vacuum won't delete
+ * any items.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +553,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +565,61 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+	}
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so applying LP_DEAD hints is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +638,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 72fce30..3e90b89 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +192,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.4.msysgit.0

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchapplication/octet-stream; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From 668cd50fb686dce91c6c68576dcd4d0c5d6b2b28 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 16:01:41 +0530
Subject: [PATCH 2/3] Remove redundant hash function _hash_step and do some
 code cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0c3338a..c379a3b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -438,212 +438,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.4.msysgit.0

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchapplication/octet-stream; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchDownload
From 7f557dc99190469218f769042a695b777a40bc93 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 16:15:49 +0530
Subject: [PATCH 3/3] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README     | 28 +++++++++++-------------
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c | 13 +++++++----
 3 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 4465605..8921c78 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -415,21 +415,17 @@ The fourth operation is garbage collection (bulk deletion):
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
 scan, but that does no harm.  As we release the lock on bucket page during
-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
-and ensures that scan will always be behind cleanup.  It is must to keep scans
-behind cleanup, else vacuum could decrease the TIDs that are required to
-complete the scan.  Now, as the scan that returns multiple tuples from the
-same bucket page always expect next valid TID to be greater than or equal to
-the current TID, it might miss the tuples.  This holds true for backward scans
-as well (backward scans first traverse each bucket starting from first bucket
-to last overflow page in the chain).  We must be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count if
-the stored tuple count and number of buckets did *not* change at any time
-during the scan.  This provides a way of correcting the stored tuple count if
-it gets out of sync for some reason.  But if a split or insertion does occur
-concurrently, the scan count is untrustworthy; instead, subtract the number of
-tuples deleted from the stored tuple count and use that.
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket.
+It is quite possible that scans get ahead of vacuum and vacuum removes some
+items from the current page being scanned, but that does no harm as we always
+copy all the matching items from a page at once in the backend local array.
+We must be careful about the statistics reported by the VACUUM operation.  What
+we can do is count the number of tuples scanned, and believe this in preference
+to the stored tuple count if the stored tuple count and number of buckets did
+*not* change at any time during the scan.  This provides a way of correcting the
+stored tuple count if it gets out of sync for some reason.  But if a split or
+insertion does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.4.msysgit.0

#36Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Amit Kapila (#35)
Re: Page Scan Mode in Hash Index

On 08/23/2017 07:38 AM, Amit Kapila wrote:

Thanks for the new version. I again looked at the patches and fixed
quite a few comments in the code and ReadMe. You have forgotten to
update README for the changes in vacuum patch
(0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I
don't have anything more to add. If you are okay with changes, then
we can move it to Ready For Committer unless someone else has some
more comments.

Just some minor comments.

README:
+ it's pin till the end of scan)

its pin till the end of the scan)

+To minimize lock/unlock traffic, hash index scan always searches entire
hash

To minimize lock/unlock traffic, hash index scan always searches the
entire hash

hashsearch.c:

+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);

There are other instances of "inline" in the code base, so I guess that
this is ok.

+ * Advance to next tuple on current page; or if there's no more, try to

Advance to the next tuple on the current page; or if done, try to

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Amit Kapila (#35)
Re: Page Scan Mode in Hash Index

Hi Amit,

On Wed, Aug 23, 2017 at 5:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 22, 2017 at 7:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Tue, Aug 22, 2017 at 3:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I got your point now. I think, currently in _hash_kill_items(),
if an overflow page is pinned we do not check if it got modified since
the last read or
not. Hence, if the vacuum runs on an overflow page that is pinned and
also has some dead tuples in it then it could create a problem for
scan basically,
when scan would attempt to mark the killed items as dead. To get rid
of such problem, i think, even if an overflow page is pinned we should
check if it got
modified or not since the last read was performed on the page. If yes,
then do not allow scan to mark killed items as dead. Attached is the
newer version with these changes along with some other cosmetic
changes mentioned in your earlier email. Thanks.

Thanks for the new version. I again looked at the patches and fixed
quite a few comments in the code and ReadMe. You have forgotten to
update README for the changes in vacuum patch
(0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I
don't have anything more to add. If you are okay with changes, then
we can move it to Ready For Committer unless someone else has some
more comments.

Thanks for reviewing my patches. I've gone through the changes done by
you in the README file and few changes in code comments. The changes
looks valid to me. But, it seems like there are some more minor review
comments from Jesper which i will fix and share the new set of patches
shortly.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Jesper Pedersen (#36)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Wed, Aug 23, 2017 at 7:39 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 08/23/2017 07:38 AM, Amit Kapila wrote:

Thanks for the new version. I again looked at the patches and fixed
quite a few comments in the code and ReadMe. You have forgotten to
update README for the changes in vacuum patch
(0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7). I
don't have anything more to add. If you are okay with changes, then
we can move it to Ready For Committer unless someone else has some
more comments.

Just some minor comments.

Thanks for the review.

README:
+ it's pin till the end of scan)

its pin till the end of the scan)

Corrected.

+To minimize lock/unlock traffic, hash index scan always searches entire
hash

To minimize lock/unlock traffic, hash index scan always searches the entire
hash

Done.

hashsearch.c:

+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+                          OffsetNumber offnum, IndexTuple itup);

There are other instances of "inline" in the code base, so I guess that this
is ok.

+ * Advance to next tuple on current page; or if there's no more, try
to

Advance to the next tuple on the current page; or if done, try to

Done.

Attached are the patches with above changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v14.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v14.patchDownload
From 501d48ef3b566569c687a9a4ac4b239b2278c789 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Thu, 24 Aug 2017 10:19:58 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 426 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  74 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 549 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..3b1f719 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+	the pin on the primary bucket throughout the scan)
+	save all the matching tuples from current index page into an items array
+	release pin and content lock (but if it is primary bucket page retain
+	its pin till the end of the scan)
+	get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches the entire
+hash page to identify all the matching items at once, copying their heap tuple
+IDs into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..45a3a5a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns dead tuples. Therefore,
+		 * we can always add the tuples into TIDBitmap without checking if a
+		 * tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 7b2906b..e050a2a 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..a42661a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,107 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to the next tuple on the current page; or if done, try to read
+	 * data from the next or previous page based on the scan direction. Before
+	 * moving to the next or previous page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +275,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +299,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +421,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +642,280 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satisfies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, deal with any
+			 * killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the previous page. Before leaving the current page, deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				continue;
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9b803af..3bf1b97 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,30 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep the pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from that page.  However,
+ * having pin on the overflow page doesn't guarantee that vacuum won't delete
+ * any items.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +553,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +565,61 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+	{
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+	}
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so applying LP_DEAD hints is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +638,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 72fce30..3e90b89 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +192,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchDownload
From 7f557dc99190469218f769042a695b777a40bc93 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 16:15:49 +0530
Subject: [PATCH 3/3] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README     | 28 +++++++++++-------------
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c | 13 +++++++----
 3 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 4465605..8921c78 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -415,21 +415,17 @@ The fourth operation is garbage collection (bulk deletion):
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
 scan, but that does no harm.  As we release the lock on bucket page during
-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
-and ensures that scan will always be behind cleanup.  It is must to keep scans
-behind cleanup, else vacuum could decrease the TIDs that are required to
-complete the scan.  Now, as the scan that returns multiple tuples from the
-same bucket page always expect next valid TID to be greater than or equal to
-the current TID, it might miss the tuples.  This holds true for backward scans
-as well (backward scans first traverse each bucket starting from first bucket
-to last overflow page in the chain).  We must be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count if
-the stored tuple count and number of buckets did *not* change at any time
-during the scan.  This provides a way of correcting the stored tuple count if
-it gets out of sync for some reason.  But if a split or insertion does occur
-concurrently, the scan count is untrustworthy; instead, subtract the number of
-tuples deleted from the stored tuple count and use that.
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket.
+It is quite possible that scans get ahead of vacuum and vacuum removes some
+items from the current page being scanned, but that does no harm as we always
+copy all the matching items from a page at once in the backend local array.
+We must be careful about the statistics reported by the VACUUM operation.  What
+we can do is count the number of tuples scanned, and believe this in preference
+to the stored tuple count if the stored tuple count and number of buckets did
+*not* change at any time during the scan.  This provides a way of correcting the
+stored tuple count if it gets out of sync for some reason.  But if a split or
+insertion does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.4.msysgit.0

#39Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Ashutosh Sharma (#38)
Re: Page Scan Mode in Hash Index

On 08/24/2017 01:21 AM, Ashutosh Sharma wrote:

Done.

Attached are the patches with above changes.

Thanks !

Based on the feedback in this thread, I have moved the patch to "Ready
for Committer".

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Pedersen (#39)
Re: Page Scan Mode in Hash Index

On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Based on the feedback in this thread, I have moved the patch to "Ready for
Committer".

Reviewing 0001:

_hash_readpage gets the page LSN to see if we can apply LP_DEAD hints,
but if the table is unlogged or temporary, the LSN will never change,
so the test in _hash_kill_items will always think that it's OK to
apply the hints. (This seems like it might be a pretty serious
problem, because I'm not sure what would be a viable workaround.)

The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get
updated is not, to my eyes, obviously correct. Generally, the logic
for this stuff seems unnaturally spread out to me. For example,
_hash_next() updates currPos.buf, but leaves it to _hash_readpage to
set currPage and lsn. That function also sets all three fields when
it advances to the next page by calling _hash_readnext(), but when it
tries to advance to the next page and doesn't find one it sets buf but
not currPage or lsn. It seems to me that this logic should be more
centralized somehow. Can we arrange things so that at least buf,
currPage, and lsn, and maybe also nextPage and prevPage, get updated
at the same time and as soon after reading the buffer as possible?

It would be bad if a primary bucket page's hasho_prevblkno field got
copied into so->currPos.prevpage, because the value that appears for
the primary bucket is not a real block number. But _hash_readpage
seems like it can bring about this exact situation, because of this
code:

+            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+                prev_blkno = opaque->hasho_prevblkno;
...
+                so->currPos.prevPage = prev_blkno;

If we're reading the primary bucket page and there are no overflow
pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be
a real block number.

Incidentally, the "if" statement in the above block of code is
probably not saving anything; I would suggest for clarity that you do
the assignment unconditionally (but this is just a matter of style, so
I don't feel super-strongly about it).

+ return (so->currPos.firstItem <= so->currPos.lastItem);

Can this ever return false? It looks to me like we should never reach
this code unless that condition holds, and that it would be a bug if
we did. If that's correct, maybe this should be an Assert() and the
return statement should just return true.

+        buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+        /* It might not exist anymore; in which case we can't hint it. */
+        if (!BufferIsValid(buf))
+            return;

This is dead code, because _hash_getbuf always returns a valid buffer.
If there's actually a risk of the buffer disappearing, then some other
handling is needed for this case. But I suspect that, since a scan
always holds a pin on the primary bucket, there's actually no way for
this to happen and this is just dead code.

The comment in hashgetbitmap claims that _hash_first() or _hash_next()
never returns dead tuples. If that were true, it would be a bug,
because then scans started during recovery would return wrong answers.
A more accurate statement would be something like: _hash_first and
_hash_next handle eliminate dead index entries whenever
scan->ignored_killed_tuples is true. Therefore, there's nothing to do
here except add the results to the TIDBitmap.

_hash_readpage contains unnecessary "continue" statements inside the
loops. The reason that they're unnecessary is that there's no code
below that in the loop anyway, so the loop is already going to go back
around to the top. Whether to change this is a matter of style, so I
won't complain too much if you want to leave it the way it is, but
personally I'd favor removing them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#40)
Re: Page Scan Mode in Hash Index

On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Based on the feedback in this thread, I have moved the patch to "Ready for
Committer".

Reviewing 0001:

_hash_readpage gets the page LSN to see if we can apply LP_DEAD hints,
but if the table is unlogged or temporary, the LSN will never change,
so the test in _hash_kill_items will always think that it's OK to
apply the hints. (This seems like it might be a pretty serious
problem, because I'm not sure what would be a viable workaround.)

This point has been discussed above [1]/messages/by-id/CAA4eK1J6xiJUOidBaOt0iPsAdS0+p5PoKFf1R2yVjTwrY_4snA@mail.gmail.com and to avoid this problem we
are keeping the scan always behind vacuum for unlogged and temporary
tables as we are doing without this patch. That will ensure vacuum
won't be able to remove the TIDs which we are going to mark as dead.
This has been taken care in patch 0003. I understand that this is
slightly ugly, but the other alternative (as mentioned in the email
[1]: /messages/by-id/CAA4eK1J6xiJUOidBaOt0iPsAdS0+p5PoKFf1R2yVjTwrY_4snA@mail.gmail.com

[1]: /messages/by-id/CAA4eK1J6xiJUOidBaOt0iPsAdS0+p5PoKFf1R2yVjTwrY_4snA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#40)
3 attachment(s)
Re: Page Scan Mode in Hash Index

Thanks for all your review comments. Please find my comments in-line.

On Tue, Sep 19, 2017 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 24, 2017 at 11:26 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Based on the feedback in this thread, I have moved the patch to "Ready for
Committer".

Reviewing 0001:

_hash_readpage gets the page LSN to see if we can apply LP_DEAD hints,
but if the table is unlogged or temporary, the LSN will never change,
so the test in _hash_kill_items will always think that it's OK to
apply the hints. (This seems like it might be a pretty serious
problem, because I'm not sure what would be a viable workaround.)

Amit has already replied to this query up-thread.

The logic that tries to ensure that so->currPos.{buf,currPage,lsn} get
updated is not, to my eyes, obviously correct. Generally, the logic
for this stuff seems unnaturally spread out to me. For example,
_hash_next() updates currPos.buf, but leaves it to _hash_readpage to
set currPage and lsn. That function also sets all three fields when
it advances to the next page by calling _hash_readnext(), but when it
tries to advance to the next page and doesn't find one it sets buf but
not currPage or lsn. It seems to me that this logic should be more
centralized somehow. Can we arrange things so that at least buf,
currPage, and lsn, and maybe also nextPage and prevPage, get updated
at the same time and as soon after reading the buffer as possible?

Okay, I have tried to update currPos.{buf, currPage, lsn} in
_hash_readpage() at the same time. Please have a look into the
attached 0001*.patch.

When _hash_readpage() doesn't find any qualifying tuples i.e. when
_hash_readnext() returns Invalid buffer, we just update prevPage,
nextPage and buf in
currPos (not currPage or lsn) as currPage and lsn should point to last
page in the hash bucket so that we can mark the killed items as dead
at the end of scan (with the help of _hash_kill_items). Hence, we keep
the currpage and lsn as it is if no more valid hash pages are found.

It would be bad if a primary bucket page's hasho_prevblkno field got
copied into so->currPos.prevpage, because the value that appears for
the primary bucket is not a real block number. But _hash_readpage
seems like it can bring about this exact situation, because of this
code:

+            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+                prev_blkno = opaque->hasho_prevblkno;
...
+                so->currPos.prevPage = prev_blkno;

If we're reading the primary bucket page and there are no overflow
pages, hasho_nextblkno will not be valid and hasho_prevblkno won't be
a real block number.

Fixed. Thanks for putting that point.

Incidentally, the "if" statement in the above block of code is
probably not saving anything; I would suggest for clarity that you do
the assignment unconditionally (but this is just a matter of style, so
I don't feel super-strongly about it).

+ return (so->currPos.firstItem <= so->currPos.lastItem);

Can this ever return false? It looks to me like we should never reach
this code unless that condition holds, and that it would be a bug if
we did. If that's correct, maybe this should be an Assert() and the
return statement should just return true.

No, it will never return FALSE. I have changed it to Assert statement.

+        buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+        /* It might not exist anymore; in which case we can't hint it. */
+        if (!BufferIsValid(buf))
+            return;

This is dead code, because _hash_getbuf always returns a valid buffer.
If there's actually a risk of the buffer disappearing, then some other
handling is needed for this case. But I suspect that, since a scan
always holds a pin on the primary bucket, there's actually no way for
this to happen and this is just dead code.

Removed the redundant code.

The comment in hashgetbitmap claims that _hash_first() or _hash_next()
never returns dead tuples. If that were true, it would be a bug,
because then scans started during recovery would return wrong answers.
A more accurate statement would be something like: _hash_first and
_hash_next handle eliminate dead index entries whenever
scan->ignored_killed_tuples is true. Therefore, there's nothing to do
here except add the results to the TIDBitmap.

Corrected.

_hash_readpage contains unnecessary "continue" statements inside the
loops. The reason that they're unnecessary is that there's no code
below that in the loop anyway, so the loop is already going to go back
around to the top. Whether to change this is a matter of style, so I
won't complain too much if you want to leave it the way it is, but
personally I'd favor removing them.

Ohh no, that's a silly mistake. I have corrected it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v15.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v15.patchDownload
From 38d41fd97f2f83f1e8aa7c2f62665c5b9279cbaa Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 20 Sep 2017 13:52:21 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 430 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  67 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 546 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..3b1f719 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+	the pin on the primary bucket throughout the scan)
+	save all the matching tuples from current index page into an items array
+	release pin and content lock (but if it is primary bucket page retain
+	its pin till the end of the scan)
+	get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches the entire
+hash page to identify all the matching items at once, copying their heap tuple
+IDs into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..8550218 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first and _hash_next handle eliminate dead index entries
+		 * whenever scan->ignored_killed_tuples is true.  Therefore, there's
+		 * nothing to do here except add the results to the TIDBitmap.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 0579841..f279dce 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..e3db672 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,105 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to the next tuple on the current page; or if done, try to read
+	 * data from the next or previous page based on the scan direction. Before
+	 * moving to the next or previous page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +273,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +419,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +640,286 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satisfies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	so->currPos.buf = buf;
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+	so->currPos.currPage = BufferGetBlockNumber(buf);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, deal with any
+			 * killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+			{
+				if (so->currPos.buf == so->hashso_bucket_buf ||
+					so->currPos.buf == so->hashso_split_bucket_buf)
+					prev_blkno = InvalidBlockNumber;
+				else
+					prev_blkno = opaque->hasho_prevblkno;
+			}
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the previous page. Before leaving the current page, deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	return true;
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 869cbc1..a825b82 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,30 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep the pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from that page.  However,
+ * having pin on the overflow page doesn't guarantee that vacuum won't delete
+ * any items.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +553,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +565,54 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so applying LP_DEAD hints is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index c06dcb2..ce2124a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -114,6 +114,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -156,8 +203,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchDownload
From 7f557dc99190469218f769042a695b777a40bc93 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 16:15:49 +0530
Subject: [PATCH 3/3] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README     | 28 +++++++++++-------------
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c | 13 +++++++----
 3 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 4465605..8921c78 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -415,21 +415,17 @@ The fourth operation is garbage collection (bulk deletion):
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
 scan, but that does no harm.  As we release the lock on bucket page during
-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
-and ensures that scan will always be behind cleanup.  It is must to keep scans
-behind cleanup, else vacuum could decrease the TIDs that are required to
-complete the scan.  Now, as the scan that returns multiple tuples from the
-same bucket page always expect next valid TID to be greater than or equal to
-the current TID, it might miss the tuples.  This holds true for backward scans
-as well (backward scans first traverse each bucket starting from first bucket
-to last overflow page in the chain).  We must be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count if
-the stored tuple count and number of buckets did *not* change at any time
-during the scan.  This provides a way of correcting the stored tuple count if
-it gets out of sync for some reason.  But if a split or insertion does occur
-concurrently, the scan count is untrustworthy; instead, subtract the number of
-tuples deleted from the stored tuple count and use that.
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket.
+It is quite possible that scans get ahead of vacuum and vacuum removes some
+items from the current page being scanned, but that does no harm as we always
+copy all the matching items from a page at once in the backend local array.
+We must be careful about the statistics reported by the VACUUM operation.  What
+we can do is count the number of tuples scanned, and believe this in preference
+to the stored tuple count if the stored tuple count and number of buckets did
+*not* change at any time during the scan.  This provides a way of correcting the
+stored tuple count if it gets out of sync for some reason.  But if a split or
+insertion does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.4.msysgit.0

#43Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#41)
Re: Page Scan Mode in Hash Index

On Tue, Sep 19, 2017 at 11:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

This point has been discussed above [1] and to avoid this problem we
are keeping the scan always behind vacuum for unlogged and temporary
tables as we are doing without this patch. That will ensure vacuum
won't be able to remove the TIDs which we are going to mark as dead.
This has been taken care in patch 0003. I understand that this is
slightly ugly, but the other alternative (as mentioned in the email
[1]) is much worse.

Hmm. So if I understand correctly, you're saying that the LSN check
in patch 0001 is actually completely unnecessary if we only apply
0001, but is needed in preparation for 0003, after which it will
really be doing something?

In more detail, I suppose the idea is: a TID cannot be reused until a
VACUUM has intervened; VACUUM always visits every data page in the
index; we won't allow a scan to pass VACUUM (and thus possibly have
one of its TIDs get reused) except when the LSN check is actually
sufficient to guarantee no TID reuse (i.e. table is not unlogged).
Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter
because such an operation doesn't allow TIDs to be reused.

Did I get it right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#43)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 19, 2017 at 11:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

This point has been discussed above [1] and to avoid this problem we
are keeping the scan always behind vacuum for unlogged and temporary
tables as we are doing without this patch. That will ensure vacuum
won't be able to remove the TIDs which we are going to mark as dead.
This has been taken care in patch 0003. I understand that this is
slightly ugly, but the other alternative (as mentioned in the email
[1]) is much worse.

Hmm. So if I understand correctly, you're saying that the LSN check
in patch 0001 is actually completely unnecessary if we only apply
0001, but is needed in preparation for 0003, after which it will
really be doing something?

Right.

In more detail, I suppose the idea is: a TID cannot be reused until a
VACUUM has intervened; VACUUM always visits every data page in the
index; we won't allow a scan to pass VACUUM (and thus possibly have
one of its TIDs get reused) except when the LSN check is actually
sufficient to guarantee no TID reuse (i.e. table is not unlogged).

Right.

Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter
because such an operation doesn't allow TIDs to be reused.

Page-at-a-time index vacuum also allows TIDs to be reused but this is
done only for already marked dead items whereas vacuum can make the
non-dead entries to be removed. We don't have a problem with
page-at-a-time vacuum as it won't remove any items which the scan is
going to mark as dead.

Did I get it right?

I think so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#44)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter
because such an operation doesn't allow TIDs to be reused.

Page-at-a-time index vacuum also allows TIDs to be reused but this is
done only for already marked dead items whereas vacuum can make the
non-dead entries to be removed. We don't have a problem with
page-at-a-time vacuum as it won't remove any items which the scan is
going to mark as dead.

I don't think page-at-a-time index vacuum allows heap TIDs to be
reused. To reuse a heap TID, we have to know that there are no index
entries pointing to it. There's no way for the heap to know that a
page-at-a-time index vacuum has even happened, let alone which TIDs
were affected and that all other indexes have also removed all index
entries for those TIDs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#45)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 4:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 20, 2017 at 7:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Page-at-a-time index vacuum as in _hash_vacuum_one_page doesn't matter
because such an operation doesn't allow TIDs to be reused.

Page-at-a-time index vacuum also allows TIDs to be reused but this is
done only for already marked dead items whereas vacuum can make the
non-dead entries to be removed. We don't have a problem with
page-at-a-time vacuum as it won't remove any items which the scan is
going to mark as dead.

I don't think page-at-a-time index vacuum allows heap TIDs to be
reused.

Right, I was thinking from the perspective of the index entry. Before
marking index entry as dead, we do check for heaptid. So, as heaptid
can't be reused via Page-at-a-time index vacuum, scan won't mark index
entry as dead.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#46)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I was thinking from the perspective of the index entry. Before
marking index entry as dead, we do check for heaptid. So, as heaptid
can't be reused via Page-at-a-time index vacuum, scan won't mark index
entry as dead.

It can mark index entries dead, but if it does, they correspond to
heap TIDs that are still dead, as opposed to heap TIDs that have been
resurrected by being reused for an unrelated tuple.

In other words, the danger scenario is this:

1. A page-at-a-time scan records all the TIDs on a page.
2. VACUUM processes the page, removing some of those TIDs.
3. VACUUM finishes, changing the heap TIDs from dead to unused.
4. Somebody inserts a new tuple at one of the existing TIDs, and the
index tuple gets put on the page scanned in step 1.
5. The page-at-a-time scan resumes and kills the tuple added in step 4
by mistake, when it really only intended to kill a tuple removed in
step 2.

What prevent this is:

A. To begin scanning a bucket, VACUUM needs a cleanup lock on the
primary bucket page. Therefore, there are no scans in progress at the
time that VACUUM begins scanning the bucket.

B. If a scan begins scanning the bucket, it can't pass VACUUM, because
VACUUM doesn't release the page lock on one page before taking the one
for the next page.

C. After 0003, it becomes possible for a scan to pass VACUUM if the
table is permanent, but it won't be a problem because of the LSN
check.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#47)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 6:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 20, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I was thinking from the perspective of the index entry. Before
marking index entry as dead, we do check for heaptid. So, as heaptid
can't be reused via Page-at-a-time index vacuum, scan won't mark index
entry as dead.

It can mark index entries dead, but if it does, they correspond to
heap TIDs that are still dead, as opposed to heap TIDs that have been
resurrected by being reused for an unrelated tuple.

In other words, the danger scenario is this:

1. A page-at-a-time scan records all the TIDs on a page.
2. VACUUM processes the page, removing some of those TIDs.
3. VACUUM finishes, changing the heap TIDs from dead to unused.
4. Somebody inserts a new tuple at one of the existing TIDs, and the
index tuple gets put on the page scanned in step 1.
5. The page-at-a-time scan resumes and kills the tuple added in step 4
by mistake, when it really only intended to kill a tuple removed in
step 2.

What prevent this is:

A. To begin scanning a bucket, VACUUM needs a cleanup lock on the
primary bucket page. Therefore, there are no scans in progress at the
time that VACUUM begins scanning the bucket.

B. If a scan begins scanning the bucket, it can't pass VACUUM, because
VACUUM doesn't release the page lock on one page before taking the one
for the next page.

C. After 0003, it becomes possible for a scan to pass VACUUM if the
table is permanent, but it won't be a problem because of the LSN
check.

That's right. So, in short, this patch handles the problemetic scenario.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Sharma (#42)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 5:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks for all your review comments. Please find my comments in-line.

+            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+            {
+                if (so->currPos.buf == so->hashso_bucket_buf ||
+                    so->currPos.buf == so->hashso_split_bucket_buf)
+                    prev_blkno = InvalidBlockNumber;
+                else
+                    prev_blkno = opaque->hasho_prevblkno;
+            }

1. Why not remove the outer "if" statement?

2. How about adding a comment, like /* If this is a primary bucket
page, hasho_prevblkno is not a real block number. */

When _hash_readpage() doesn't find any qualifying tuples i.e. when
_hash_readnext() returns Invalid buffer, we just update prevPage,
nextPage and buf in
currPos (not currPage or lsn) as currPage and lsn should point to last
page in the hash bucket so that we can mark the killed items as dead
at the end of scan (with the help of _hash_kill_items). Hence, we keep
the currpage and lsn as it is if no more valid hash pages are found.

How about adding a comment about this, by extending this comment:

+                 * Remember next and previous block numbers for scrollable
+                 * cursors to know the start position and return FALSE
+                 * indicating that no more matching tuples were found.

e.g. (Don't reset currPage or lsn, because we expect _hash_kill_items
to be called for the old page after this function returns.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#49)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 8:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 20, 2017 at 5:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Thanks for all your review comments. Please find my comments in-line.

+            if (!BlockNumberIsValid(opaque->hasho_nextblkno))
+            {
+                if (so->currPos.buf == so->hashso_bucket_buf ||
+                    so->currPos.buf == so->hashso_split_bucket_buf)
+                    prev_blkno = InvalidBlockNumber;
+                else
+                    prev_blkno = opaque->hasho_prevblkno;
+            }

1. Why not remove the outer "if" statement?

Yes, the outer if statement is not required. I just missed to remove
that in my earlier patch.

2. How about adding a comment, like /* If this is a primary bucket
page, hasho_prevblkno is not a real block number. */

Added.

When _hash_readpage() doesn't find any qualifying tuples i.e. when
_hash_readnext() returns Invalid buffer, we just update prevPage,
nextPage and buf in
currPos (not currPage or lsn) as currPage and lsn should point to last
page in the hash bucket so that we can mark the killed items as dead
at the end of scan (with the help of _hash_kill_items). Hence, we keep
the currpage and lsn as it is if no more valid hash pages are found.

How about adding a comment about this, by extending this comment:

+                 * Remember next and previous block numbers for scrollable
+                 * cursors to know the start position and return FALSE
+                 * indicating that no more matching tuples were found.

e.g. (Don't reset currPage or lsn, because we expect _hash_kill_items
to be called for the old page after this function returns.)

Added.

Attached are the patches with above changes. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v7.patchDownload
From 7f557dc99190469218f769042a695b777a40bc93 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Wed, 23 Aug 2017 16:15:49 +0530
Subject: [PATCH 3/3] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README     | 28 +++++++++++-------------
 src/backend/access/hash/hash.c     | 44 ++++++++++++++++++++++++++------------
 src/backend/access/hash/hashovfl.c | 13 +++++++----
 3 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 4465605..8921c78 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -415,21 +415,17 @@ The fourth operation is garbage collection (bulk deletion):
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
 scan, but that does no harm.  As we release the lock on bucket page during
-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
-and ensures that scan will always be behind cleanup.  It is must to keep scans
-behind cleanup, else vacuum could decrease the TIDs that are required to
-complete the scan.  Now, as the scan that returns multiple tuples from the
-same bucket page always expect next valid TID to be greater than or equal to
-the current TID, it might miss the tuples.  This holds true for backward scans
-as well (backward scans first traverse each bucket starting from first bucket
-to last overflow page in the chain).  We must be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count if
-the stored tuple count and number of buckets did *not* change at any time
-during the scan.  This provides a way of correcting the stored tuple count if
-it gets out of sync for some reason.  But if a split or insertion does occur
-concurrently, the scan count is untrustworthy; instead, subtract the number of
-tuples deleted from the stored tuple count and use that.
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket.
+It is quite possible that scans get ahead of vacuum and vacuum removes some
+items from the current page being scanned, but that does no harm as we always
+copy all the matching items from a page at once in the backend local array.
+We must be careful about the statistics reported by the VACUUM operation.  What
+we can do is count the number of tuples scanned, and believe this in preference
+to the stored tuple count if the stored tuple count and number of buckets did
+*not* change at any time during the scan.  This provides a way of correcting the
+stored tuple count if it gets out of sync for some reason.  But if a split or
+insertion does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 45a3a5a..012e00f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -660,11 +660,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +831,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.4.msysgit.0

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v16.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v16.patchDownload
From 485627178cfaf73a908498e79229e4db04a99648 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 20 Sep 2017 20:53:06 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 437 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  67 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 553 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..3b1f719 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+	the pin on the primary bucket throughout the scan)
+	save all the matching tuples from current index page into an items array
+	release pin and content lock (but if it is primary bucket page retain
+	its pin till the end of the scan)
+	get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches the entire
+hash page to identify all the matching items at once, copying their heap tuple
+IDs into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..8550218 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first and _hash_next handle eliminate dead index entries
+		 * whenever scan->ignored_killed_tuples is true.  Therefore, there's
+		 * nothing to do here except add the results to the TIDBitmap.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 0579841..f279dce 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..55cb651 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,105 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to the next tuple on the current page; or if done, try to read
+	 * data from the next or previous page based on the scan direction. Before
+	 * moving to the next or previous page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +273,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +419,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +640,293 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satisfies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	so->currPos.buf = buf;
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+	so->currPos.currPage = BufferGetBlockNumber(buf);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, deal with any
+			 * killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			/*
+			 * If this is a primary bucket page, hasho_prevblkno is not a real
+			 * block number.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				prev_blkno = InvalidBlockNumber;
+			else
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found. Also,
+				 * don't reset currPage or lsn, because we expect
+				 * _hash_kill_items to be called for the old page after this
+				 * function returns.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the previous page. Before leaving the current page, deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found. Also,
+				 * don't reset currPage or lsn, because we expect
+				 * _hash_kill_items to be called for the old page after this
+				 * function returns.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	return true;
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 869cbc1..a825b82 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,30 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep the pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from that page.  However,
+ * having pin on the overflow page doesn't guarantee that vacuum won't delete
+ * any items.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +553,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +565,54 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so applying LP_DEAD hints is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index c06dcb2..ce2124a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -114,6 +114,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -156,8 +203,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

#51Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Sharma (#50)
Re: Page Scan Mode in Hash Index

On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached are the patches with above changes. Thanks.

Thanks. I think that the comments and README changes in 0003 need
significantly more work. In several places, they fail to note the
unlogged vs. logged differences, and the header comment for
hashbucketcleanup still says that scans depend on increasing-TID order
(really, 0001 should change that text somehow).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#51)
Re: Page Scan Mode in Hash Index

On Thu, Sep 21, 2017 at 9:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached are the patches with above changes. Thanks.

Thanks. I think that the comments and README changes in 0003 need
significantly more work. In several places, they fail to note the
unlogged vs. logged differences, and the header comment for
hashbucketcleanup still says that scans depend on increasing-TID order
(really, 0001 should change that text somehow).

Thanks for putting that point. I will try to correct the comments in
hashbucketcleanup(), mention about the handling done for logged and
unlogged tables in README and submit the updated patch asap.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#51)
3 attachment(s)
Re: Page Scan Mode in Hash Index

On Thu, Sep 21, 2017 at 9:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 20, 2017 at 11:43 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached are the patches with above changes. Thanks.

Thanks. I think that the comments and README changes in 0003 need
significantly more work. In several places, they fail to note the
unlogged vs. logged differences, and the header comment for
hashbucketcleanup still says that scans depend on increasing-TID order
(really, 0001 should change that text somehow).

I have added a note for handling of logged and unlogged tables in
README file and also corrected the header comment for
hashbucketcleanup(). Please find the attached 0003*.patch having these
changes. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v16.patchtext/x-patch; charset=US-ASCII; name=0001-Rewrite-hash-index-scan-to-work-page-at-a-time_v16.patchDownload
From 485627178cfaf73a908498e79229e4db04a99648 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Wed, 20 Sep 2017 20:53:06 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a time.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 146 ++----------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 437 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  67 +++++-
 src/include/access/hash.h            |  55 ++++-
 6 files changed, 553 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..3b1f719 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+	the pin on the primary bucket throughout the scan)
+	save all the matching tuples from current index page into an items array
+	release pin and content lock (but if it is primary bucket page retain
+	its pin till the end of the scan)
+	get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches the entire
+hash page to identify all the matching items at once, copying their heap tuple
+IDs into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d89c192..8550218 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,21 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +296,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-					ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +308,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
-
-	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
 
 	return res;
 }
@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first and _hash_next handle eliminate dead index entries
+		 * whenever scan->ignored_killed_tuples is true.  Therefore, there's
+		 * nothing to do here except add the results to the TIDBitmap.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items. Also, ensure
-	 * that we acquire lock on current page before calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan);
 	}
 
 	_hash_dropscanbuf(rel, so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 0579841..f279dce 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 3e461ad..55cb651 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,105 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			   ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			   Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple. so->currPos is updated as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with pin
+ *		held on bucket page but no pins or locks held on overflow
+ *		page.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to the next tuple on the current page; or if done, try to read
+	 * data from the next or previous page based on the scan direction. Before
+	 * moving to the next or previous page make sure that we deal with all the
+	 * killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +273,18 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in the
+ *		index that satisfies the qualification associated with the scan
+ *		descriptor.
+ *
+ *		On successful exit, if the page containing current index tuple is an
+ *		overflow page, both pin and lock are released whereas if it is a bucket
+ *		page then it is pinned but not locked and data about the matching
+ *		tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		On failure exit (no more tuples), we return FALSE, with pin held on
+ *		bucket page but no pins or locks held on overflow page.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,10 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
+	HashScanPosItem *currItem;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +419,19 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
+
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
 		return false;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +640,293 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satisfies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque opaque;
+	OffsetNumber offnum;
+	uint16		itemIndex;
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	so->currPos.buf = buf;
+
+	/*
+	 * We save the LSN of the page as we read it, so that we know whether it
+	 * is safe to apply LP_DEAD hints to the page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+	so->currPos.currPage = BufferGetBlockNumber(buf);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		BlockNumber prev_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != 0)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the next page. Before leaving the current page, deal with any
+			 * killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			/*
+			 * If this is a primary bucket page, hasho_prevblkno is not a real
+			 * block number.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				prev_blkno = InvalidBlockNumber;
+			else
+				prev_blkno = opaque->hasho_prevblkno;
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found. Also,
+				 * don't reset currPage or lsn, because we expect
+				 * _hash_kill_items to be called for the old page after this
+				 * function returns.
+				 */
+				so->currPos.prevPage = prev_blkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		BlockNumber next_blkno = InvalidBlockNumber;
+
+		for (;;)
+		{
+			/* new page, locate starting position by binary search */
+			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+			if (itemIndex != MaxIndexTuplesPerPage)
+				break;
+
+			/*
+			 * Could not find any matching tuples in the current page, move to
+			 * the previous page. Before leaving the current page, deal with
+			 * any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan);
+
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+				next_blkno = opaque->hasho_nextblkno;
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+			}
+			else
+			{
+				/*
+				 * Remember next and previous block numbers for scrollable
+				 * cursors to know the start position and return FALSE
+				 * indicating that no more matching tuples were found. Also,
+				 * don't reset currPage or lsn, because we expect
+				 * _hash_kill_items to be called for the old page after this
+				 * function returns.
+				 */
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = next_blkno;
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	if (so->currPos.buf == so->hashso_bucket_buf ||
+		so->currPos.buf == so->hashso_split_bucket_buf)
+	{
+		so->currPos.prevPage = InvalidBlockNumber;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		so->currPos.prevPage = opaque->hasho_prevblkno;
+		so->currPos.nextPage = opaque->hasho_nextblkno;
+		_hash_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	return true;
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple	itup;
+	int			itemIndex;
+	OffsetNumber maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation for the scan
+			 * that has started when split was in progress. Also, skip the
+			 * tuples that are marked as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				 (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				 (ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples exist in this page. so, exit while
+				 * loop.
+				 */
+				break;
+			}
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 869cbc1..a825b82 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,30 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep the pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from that page.  However,
+ * having pin on the overflow page doesn't guarantee that vacuum won't delete
+ * any items.
+ *
+ * See _bt_killitems() for more details.
  */
 void
 _hash_kill_items(IndexScanDesc scan)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
@@ -536,9 +553,11 @@ _hash_kill_items(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
+	bool		havePin = false;
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -546,20 +565,54 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (HashScanPosIsPinned(so->currPos))
+	{
+		/*
+		 * We already have pin on this buffer, so, all we need to do is
+		 * acquire lock on it.
+		 */
+		havePin = true;
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+	}
+	else
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/*
+	 * If page LSN differs it means that the page was modified since the last
+	 * read. killedItems could be not valid so applying LP_DEAD hints is not
+	 * safe.
+	 */
+	page = BufferGetPage(buf);
+	if (PageGetLSN(page) != so->currPos.lsn)
+	{
+		if (havePin)
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		else
+			_hash_relbuf(rel, buf);
+		return;
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int			itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -578,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index c06dcb2..ce2124a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -114,6 +114,53 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
+}			HashScanPosData;
+
+#define HashScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -156,8 +203,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem *killedItems;	/* tids and offset numbers of killed items */
+	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them in
+	 * HashScanPosData
+	 */
+	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
-- 
1.8.3.1

0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-redundant-hash-function-_hash_step-and-do-som.patchDownload
From ef4180ffcaea44054d5b4894240be804c3970c6d Mon Sep 17 00:00:00 2001
From: ashu <ashutosh12.1@example.com>
Date: Mon, 7 Aug 2017 16:22:19 +0530
Subject: [PATCH] Remove redundant hash function _hash_step and do some code
 cleanup.

Remove redundant function _hash_step() and some of the unused members
of HashScanOpaqueData. The function _hash_step() used to find the next
qualifing tuple in the index page is no more required as new hash index
works page at a time which means it reads all the qualifing tuples in a
page at once with the help of _hash_readpage().

Patch by Ashutosh Sharma <ashu.coek88@gmail.com>
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index f4408ab..58eb108 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -431,212 +431,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;	/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3e90b89..19fb147 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -159,14 +159,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -177,12 +169,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -432,7 +418,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v8.patchtext/x-patch; charset=US-ASCII; name=0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index_v8.patchDownload
From c3bd06eb05fa600d70223acbd6a319cf53b990f0 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sharma@enterprisedb.com>
Date: Thu, 21 Sep 2017 12:24:04 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index for
 regular tables.

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/README     | 33 +++++++++++-----------
 src/backend/access/hash/hash.c     | 58 +++++++++++++++++++++++++-------------
 src/backend/access/hash/hashovfl.c | 13 ++++++---
 3 files changed, 65 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 3b1f719..a77e45d 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -415,21 +415,22 @@ The fourth operation is garbage collection (bulk deletion):
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
 scan, but that does no harm.  As we release the lock on bucket page during
-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
-and ensures that scan will always be behind cleanup.  It is must to keep scans
-behind cleanup, else vacuum could decrease the TIDs that are required to
-complete the scan.  Now, as the scan that returns multiple tuples from the
-same bucket page always expect next valid TID to be greater than or equal to
-the current TID, it might miss the tuples.  This holds true for backward scans
-as well (backward scans first traverse each bucket starting from first bucket
-to last overflow page in the chain).  We must be careful about the statistics
-reported by the VACUUM operation.  What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count if
-the stored tuple count and number of buckets did *not* change at any time
-during the scan.  This provides a way of correcting the stored tuple count if
-it gets out of sync for some reason.  But if a split or insertion does occur
-concurrently, the scan count is untrustworthy; instead, subtract the number of
-tuples deleted from the stored tuple count and use that.
+cleanup scan of a bucket, it will allow concurrent scan to start on a bucket.
+It is quite possible that scans on a regular table get ahead of vacuum and
+vacuum removes some items from the current page being scanned, but that does
+no harm as we always copy all the matching items from a page at once in the
+backend local array and also check for the page's lsn before marking a tuple
+in a page as dead. However, this is not valid for unlogged or temporary tables,
+as for these tables lsn is not meaningful and therefore, there is a chance that
+if scan overtakes vacuum it might mark some valid tuple as dead. Hence, for
+unlogged or temporary tables we always ensure that scan follows VACUUM. We must
+be careful about the statistics reported by the VACUUM operation.  What we can
+do is count the number of tuples scanned, and believe this in preference to the
+stored tuple count if the stored tuple count and number of buckets did *not*
+change at any time during the scan.  This provides a way of correcting the stored
+tuple count if it gets out of sync for some reason.  But if a split or insertion
+does occur concurrently, the scan count is untrustworthy; instead, subtract the
+number of tuples deleted from the stored tuple count and use that.
 
 
 Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 8550218..76474f3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -655,16 +655,18 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * primary bucket page.  The lock won't necessarily be held continuously,
  * though, because we'll release it when visiting overflow pages.
  *
- * It would be very bad if this function cleaned a page while some other
- * backend was in the midst of scanning it, because hashgettuple assumes
- * that the next valid TID will be greater than or equal to the current
- * valid TID.  There can't be any concurrent scans in progress when we first
- * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * There is a possibility of this function cleaning a page while some other
+ * backend is in the midst of scanning it, but, that won't impact the concurrent
+ * scan as it works in page at a time mode which means the hash page being
+ * scanned won't be locked/unlocked at the tuple level and therefore,
+ * hashgettuple don't need to find the tid of next valid tuple in the index page
+ * assuming that the concurrent insert might have inserted a new tuple in the
+ * page. However, we do such validation in the _hash_kill_items to ensure that
+ * we are marking the correct index tuple as dead. There can't be any concurrent
+ * scans in progress when we first enter this function because of the cleanup
+ * lock we hold on the primary bucket page, but as soon as we release that lock,
+ * there might be. But, we do not have to bother about it, as the hash index
+ * scan work in page at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -833,18 +835,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e70..b41afbb 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

#54Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Sharma (#53)
2 attachment(s)
Re: Page Scan Mode in Hash Index

On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I have added a note for handling of logged and unlogged tables in
README file and also corrected the header comment for
hashbucketcleanup(). Please find the attached 0003*.patch having these
changes. Thanks.

I committed 0001 and 0002 with some additional edits as
7c75ef571579a3ad7a1d3ee909f11dba5e0b9440. I also rebased 0003 and
edited it a bit; see attached hash-cleanup-changes.patch.

I'm not entirely sold on 0003. An alternative would be to rip the lsn
stuff back out of HashScanPosData, and I think we ought to consider
that. Basically, 0003 is betting that getting rid of the
lock-chaining in hash index vacuum is more valuable than being able to
kill dead items more aggressively. I bet that's a bad bet.

In the case of btree indexes, since
2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
allows most btree index scans to avoid holding buffer pins when the
scan is suspended, but we gain no such advantage here. We always have
to hold a pin on the primary bucket page anyway, so even with this
patch cleanup is going to block when it hits a bucket containing a
suspended scan. 0003 helps if (a) the relation is permanent, (b) the
bucket has overflow pages, and (c) the scan is moving faster than
vacuum and can overtake it instead of waiting. But that doesn't seem
like it will happen very often at all, whereas the LSN check will
probably fail frequently and cause us to skip cleanup that we could
usefully have done. So I propose the attached hashscan-no-lsn.patch
as an alternative.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hash-cleanup-changes.patchapplication/octet-stream; name=hash-cleanup-changes.patchDownload
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 5827389a70..c2c4863c02 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -396,8 +396,9 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
-			release prior lock and pin (except keep pin on primary bucket page)
+			release lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page (unless !RelationNeedsWAL, in which case
+              this is instead prior to releasing the lock and pin)
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
@@ -449,8 +450,13 @@ for a scan to start after VACUUM has released the cleanup lock on the bucket
 but before it has processed the entire bucket and then overtake the cleanup
 operation.
 
-Currently, we prevent this using lock chaining: cleanup locks the next page
-in the chain before releasing the lock and pin on the page just processed.
+For temporary and unlogged relations, we prevent this using lock chaining:
+cleanup locks the next page in the chain before releasing the lock and pin
+on the page just processed.  For permanent relations, we use a different
+solution: when a scan is about to kill items, it checks whether the page LSN
+has changed since the page was initially examined and, if so, skips trying
+to kill items.  This is considered a better solution because lock chaining is
+generally undesirable, but it also has the downside of postponing clenaup.
 
 Free Space Management
 ---------------------
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0fef60a858..3420775b0a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -662,8 +662,10 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * wake up only after VACUUM has completed and the TID has been recycled for
  * an unrelated tuple.  To avoid that calamity, we prevent scans from passing
  * our cleanup scan by locking the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is not
- * ideal, so we might want to look for a better solution at some point.)
+ * releasing the lock on the previous page.  However, we only need to worry
+ * about this for scans of temporary or unlogged tables; permanent tables
+ * won't have this problem, because _hash_kill_items will notice that the
+ * page LSN has changed and skip cleanup.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -832,18 +834,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan works in page-at-a-time mode, vacuum can
+		 * release the lock on previous page before acquiring lock on the next
+		 * page for regular tables, but, for unlogged tables, we avoid this as
+		 * we do not want scan to cross vacuum when both are running on the
+		 * same bucket page. This is to ensure that, we are safe during dead
+		 * marking of index tuples in _hash_kill_items().
 		 */
-		if (retain_pin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		if (RelationNeedsWAL(rel))
+		{
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+		}
 		else
-			_hash_relbuf(rel, buf);
+		{
+			next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+												  LH_OVERFLOW_PAGE,
+												  bstrategy);
+
+			if (retain_pin)
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, buf);
+		}
 
 		buf = next_buf;
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index c206e704d4..b41afbb416 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -524,7 +524,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
-	 * described atop hashbucketcleanup.
+	 * described atop _hash_squeezebucket.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
@@ -790,9 +790,14 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here. This means there
+ *	can't be any concurrent scans in progress when we first enter this
+ *	function because of the cleanup lock we hold on the primary bucket page,
+ *	but as soon as we release that lock, there might be. To prevent any
+ *	concurrent scan to cross the squeeze scan we use lock chaining i.e.
+ *	we lock the next page in the bucket chain before releasing the lock on
+ *	the previous page. (This type of lock chaining is not ideal, so we might
+ *	want to look for a better solution at some point.)
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
hashscan-no-lsn.patchapplication/octet-stream; name=hashscan-no-lsn.patchDownload
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index ce5515dbcb..81a206eeb7 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -463,12 +463,6 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
 	so->currPos.buf = buf;
-
-	/*
-	 * We save the LSN of the page as we read it, so that we know whether it
-	 * is safe to apply LP_DEAD hints to the page later.
-	 */
-	so->currPos.lsn = PageGetLSN(page);
 	so->currPos.currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
@@ -508,7 +502,6 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			{
 				so->currPos.buf = buf;
 				so->currPos.currPage = BufferGetBlockNumber(buf);
-				so->currPos.lsn = PageGetLSN(page);
 			}
 			else
 			{
@@ -562,7 +555,6 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			{
 				so->currPos.buf = buf;
 				so->currPos.currPage = BufferGetBlockNumber(buf);
-				so->currPos.lsn = PageGetLSN(page);
 			}
 			else
 			{
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index a825b82706..df77252267 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -579,21 +579,7 @@ _hash_kill_items(IndexScanDesc scan)
 	else
 		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 
-	/*
-	 * If page LSN differs it means that the page was modified since the last
-	 * read. killedItems could be not valid so applying LP_DEAD hints is not
-	 * safe.
-	 */
 	page = BufferGetPage(buf);
-	if (PageGetLSN(page) != so->currPos.lsn)
-	{
-		if (havePin)
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-		else
-			_hash_relbuf(rel, buf);
-		return;
-	}
-
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 0e0f3e17a7..e3135c1738 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -117,7 +117,6 @@ typedef struct HashScanPosItem	/* what we remember about each match */
 typedef struct HashScanPosData
 {
 	Buffer		buf;			/* if valid, the buffer is pinned */
-	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* current hash index page */
 	BlockNumber nextPage;		/* next overflow page */
 	BlockNumber prevPage;		/* prev overflow or bucket page */
@@ -153,7 +152,6 @@ typedef struct HashScanPosData
 #define HashScanPosInvalidate(scanpos) \
 	do { \
 		(scanpos).buf = InvalidBuffer; \
-		(scanpos).lsn = InvalidXLogRecPtr; \
 		(scanpos).currPage = InvalidBlockNumber; \
 		(scanpos).nextPage = InvalidBlockNumber; \
 		(scanpos).prevPage = InvalidBlockNumber; \
#55Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Robert Haas (#54)
Re: Page Scan Mode in Hash Index

On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I have added a note for handling of logged and unlogged tables in
README file and also corrected the header comment for
hashbucketcleanup(). Please find the attached 0003*.patch having these
changes. Thanks.

I committed 0001 and 0002 with some additional edits as
7c75ef571579a3ad7a1d3ee909f11dba5e0b9440. I also rebased 0003 and
edited it a bit; see attached hash-cleanup-changes.patch.

Thanks for the commit. I had put lot of efforts for this and very
happy that it got committed. Thanks to Amit too for the detail review.

I'm not entirely sold on 0003. An alternative would be to rip the lsn
stuff back out of HashScanPosData, and I think we ought to consider
that. Basically, 0003 is betting that getting rid of the
lock-chaining in hash index vacuum is more valuable than being able to
kill dead items more aggressively. I bet that's a bad bet.

In the case of btree indexes, since
2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
allows most btree index scans to avoid holding buffer pins when the
scan is suspended, but we gain no such advantage here. We always have
to hold a pin on the primary bucket page anyway, so even with this
patch cleanup is going to block when it hits a bucket containing a
suspended scan. 0003 helps if (a) the relation is permanent, (b) the
bucket has overflow pages, and (c) the scan is moving faster than
vacuum and can overtake it instead of waiting. But that doesn't seem
like it will happen very often at all, whereas the LSN check will
probably fail frequently and cause us to skip cleanup that we could
usefully have done. So I propose the attached hashscan-no-lsn.patch
as an alternative.

Thoughts?

--

Yes, I too feel that 0003 patch won't help much. The reason being, the
chances of scan overtaking vacuum would be very rare and also
considering the fact that hash index is normally meant for unique
values (I mean that is when hash index is quite dominant over other
indexes) which means the chances of overflow pages in hash index won't
be high. Therefore, i feel, 0003 patch won't be much beneficial.
Honestly speaking, the code changes in 0003 looks a bit ugly as well.
So, yes, hashscan no-lsn.patch looks like a better option to me.
Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#54)
1 attachment(s)
Re: Page Scan Mode in Hash Index

On Fri, Sep 22, 2017 at 11:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 3:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I have added a note for handling of logged and unlogged tables in
README file and also corrected the header comment for
hashbucketcleanup(). Please find the attached 0003*.patch having these
changes. Thanks.

I committed 0001 and 0002 with some additional edits as
7c75ef571579a3ad7a1d3ee909f11dba5e0b9440.

I have noticed a typo in that commit (in Readme) and patch for the
same is attached.

I also rebased 0003 and
edited it a bit; see attached hash-cleanup-changes.patch.

I'm not entirely sold on 0003. An alternative would be to rip the lsn
stuff back out of HashScanPosData, and I think we ought to consider
that. Basically, 0003 is betting that getting rid of the
lock-chaining in hash index vacuum is more valuable than being able to
kill dead items more aggressively. I bet that's a bad bet.

In the case of btree indexes, since
2ed5b87f96d473962ec5230fd820abfeaccb2069, page-at-a-time scanning
allows most btree index scans to avoid holding buffer pins when the
scan is suspended, but we gain no such advantage here. We always have
to hold a pin on the primary bucket page anyway, so even with this
patch cleanup is going to block when it hits a bucket containing a
suspended scan. 0003 helps if (a) the relation is permanent, (b) the
bucket has overflow pages, and (c) the scan is moving faster than
vacuum and can overtake it instead of waiting. But that doesn't seem
like it will happen very often at all, whereas the LSN check will
probably fail frequently and cause us to skip cleanup that we could
usefully have done. So I propose the attached hashscan-no-lsn.patch
as an alternative.

I think your proposal makes sense. Your patch looks good but you
might want to tweak the comments atop _hash_kill_items ("However,
having pin on the overflow page doesn't guarantee that vacuum won't
delete any items.). That part of the comment has been written to
indicate that we have to check LSN in this function unconditonally.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

typo_hash_readme_v1.patchapplication/octet-stream; name=typo_hash_readme_v1.patchDownload
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 5827389..bb90722 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -434,7 +434,7 @@ concurrent scan could start in that bucket before we've finished vacuuming it.
 If a scan gets ahead of cleanup, we could have the following problem: (1) the
 scan sees heap TIDs that are about to be removed before they are processed by
 VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3)
-VACUUM completes, (3) one or more of the TIDs the scan decided were dead are
+VACUUM completes, (4) one or more of the TIDs the scan decided were dead are
 reused for an unrelated tuple, and finally (5) the scan wakes up and
 erroneously kills the new tuple.
 
#57Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#56)
Re: Page Scan Mode in Hash Index

On Mon, Sep 25, 2017 at 12:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think your proposal makes sense. Your patch looks good but you
might want to tweak the comments atop _hash_kill_items ("However,
having pin on the overflow page doesn't guarantee that vacuum won't
delete any items.). That part of the comment has been written to
indicate that we have to check LSN in this function unconditonally.

OK, committed.

And I think that's the last of the hash index work. Thanks to Amit
Kapila, Ashutosh Sharma, Mithun Cy, Kuntal Ghosh, Jesper Pedersen, and
Dilip Kumar for all the patches and reviews and to Jeff Janes and
others for additional code review and testing!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers